Probe Design Methods and Microarrays for Comparative Genomic Hybridization and Location Analysis

ABSTRACT

Methods and systems for identifying and selecting nucleic acid probes for detecting a target with a nucleic acid probe array or comparative genome hybridization microarray, comprising selecting a plurality of potential target sequences, generating a plurality of candidate probes from the target sequences, filtering the plurality of candidate probes by analyzing candidate probes for selected probe properties in silico. Microarrays comprising probes selected by the methods of the invention are particularly useful for comparative genome hybridization and location analysis.

FIELD OF THE INVENTION

The invention relates to methods for designing and selecting probes formicroarrays, and in particular comparative genome hybridization arraysand for location analysis.

BACKGROUND OF THE INVENTION

Comparative Genomic Hybridization (CGH) and location analysis areimportant applications, which allow scientists to improve theirunderstanding of the expression and regulation of genes in biologicalsystems. Both CGH and location analysis entail quantifying or measuringchanges in copy number of genomic sequences. CGH, is particularlyimportant in developmental biology as well as the causes of cancer andoffers great potential in the diagnostics of cancer and developmentaldiseases. Recently, cDNA microarrays have been used for CGH studies. Anoligo-array based approach has several substantial advantages over othertechnologies, in that it allows the designer to position the probesanywhere within the genomic or polynucleotide sequence of interest. Theprobes can be placed at whatever density is commensurate with thereal-estate or area available on the microarray (in terms of number offeatures) and the genomic regions of interest can be evaluated byanalyzing the hybridization of target sequences to the surface-boundprobes. The oligonucleotide probe approach also offers the flexibilityof focusing in on regions within exons or introns of expressedsequences, or intergenic regions and regulatory regions for locationanalysis, as well as any desirable admixture of the aforementioned.

Probes that work well on microarrays for gene expression generally donot work well for CGH arrays and are not appropriate for locationanalysis arrays. The overall performance of probes for CGH and locationanalysis arrays entails different optimization of their properties thanprobes utilized for gene expression. Most notably, these differencesrelate to the substantially increased complexity of the labeled targetmixture for CGH and location analysis than for expression analysis whichdemands a greater specificity of the probes in discriminating againstnon-specific binding to competing targets. For comparison, the totalnumber of nucleotide bases in the human transcriptome is approximately10⁸, while the human genome contains over 3×10⁹ bases. Additionally,probes selected for gene expression come from within message sequencesthat are transcribed as RNA, i.e. exons, while probes for CGH need becomplementary, or nearly so, to contiguous targets selected from withina genome sequence e.g. introns and/or exons.

With increased target complexity comes increased flexibility in thechoice of probes. For example, many methods for gene expression restrictprobe design to several hundred bases of the 3′-end of the target(message) sequence. Thus, limiting the probe designer to a choice of onein about 500-1000 discrete positions where a probe can be started withinany given gene (or transcript). However, for CGH probe design,scientists have a much broader region in which to chose a probe for anygiven gene. This region may include introns as well as exons and istypically hundreds of thousands of bases long, and in some cases evenmillions of bases in length.

For location analysis probe design, scientists have a specific region inwhich to identify and design probes. While the probe designer isconstrained to selecting probes within regulatory regions, regionsupstream of genes and/or specific locations of interest, the overallnumber of bases which must be screened is much larger and broader thanthe region analyzed for gene expression probe design.

Despite great interest in CGH technology, methods for evaluating probesin silico and also empirically for use in this technology are limited. Arigorous method would be to measure signals (e.g. ratios) from eachpolynucleotide in controlled experiments with test samples containingknown copy numbers for each sequence on the array. For example, a methodused by several probe designers for measuring array performance for setsof polynucleotides specific for sequences on the X chromosome, is to usea series of cell lines with known variable copies of the X chromosomefor CGH experiments. These cell lines (X series) contain intact copies(e.g. 1 to 5) of the X chromosome permitting a rigorous measure of therelationship between copy number and signal intensities for each Xchromosome specific polynucleotide on an array.

However, cell lines containing known variable numbers of intact copiesof other chromosomes besides for the X chromosome in the genome are notreadily available. Furthermore, the aberrant X series cell lines areslow growing and can spontaneously vary in ploidy under standardculturing conditions. Such methods are complex and time-consuming andcannot readily be used to assay the relationship between thehybridization signal of polynucleotides on an array and the genomic copynumber of sequences from each chromosome in a cell.

Accordingly, a great need exists for methods for designing andevaluating surface-bound CGH probe nucleic acids (i.e. probes) as wellas microarrays comprising these probes which have been identified tohave probe properties which make them well suited for CGH and locationanalysis. This invention meets this, and other, needs.

Relevant Literature

United States patents of interest include: U.S. Pat. Nos. 6,465,182;6,335,167; 6,251,601; 6,210,878; 6,197,501; 6,159,685; 5,965,362;5,830,645; 5,665,549; 5,447,841 and 5,348,855. Also of interest arepublished United States Application Serial No. 2002/0006622 andpublished PCT application WO 99/23256. Articles of interest include:Pollack et al., Proc. Natl. Acad. Sci. (2002) 99: 12963-12968; Wilhelmet al., Cancer Res. (2002) 62: 957-960; Pinkel et al., Nat. Genet.(1998) 20: 207-211; Cai et al., Nat. Biotech. (2002) 20: 393-396;Snijders et al., Nat. Genet. (2001) 29:263-264; Hodgson et al., Nat.Genet. (2001) 29:459-464; Trask, Nat. Rev. Genet. (2002) 3: 769-778;Rabinovitch et al., Cancer Res. (1999) 59:5148-5153; Lee et al., HumanGenet. (1997) 100:291:304; Conlon et al. PNAS (2003) 100:3339-3344;Trinklein et al. Genome Res. (2003) 308-312; J Breslauer et al. ProcNatl Acad Sci. (PNAS) 1986 June; 83(11): 3746-3750; Naoki Sugimoto etal. Nucleic Acids Research, V24, 4505, 1996.

SUMMARY OF THE INVENTION

Methods for designing and identifying probes for array basedmeasurements of genomic copy number for comparative genomichybridization and location analysis are provided. Specifically, a methodfor generating candidate probes from a target sequence or genomicsequence of interest, repeat-masking the target sequence to formnon-repeat masked regions; and tiling, generating a periodic set ofsequences across the non-repeat masked regions to generate the candidateprobes.

The above method may further comprise screening the candidate probesaccording to at least one of several in silico parameters and orproperties. The method of the invention may also comprise screening thecandidate probes according to at least one experimentally measurableparameter or property. The method may further comprise validating thecandidate probes by target hybridization experiments.

In some embodiments, the method may further comprise identifyingrestriction cut sites within the target sequence, and selecting targetsequences that exclude or are bounded by these restriction sites whengenerating candidate probes. Filtering out target sequences withrestriction cut sites, reduces the number of possible candidate probesprior to other components of in silico analysis and decreases the amountcomputational time needed to evaluate the candidate probes.

In other embodiments, the screening according to the in silicoparameters comprises annotating the candidate probes for expression andassociation with the genes of interest. The screening may also compriseanalyzing the candidate probes for target specificity and/orthermodynamically annotating the candidate probes. In yet anotherembodiment, the in silico parameters may comprise a parameter forkinetic properties of the candidate probes.

Methods which comprise in silico annotation may include annotating thecandidate probes for their thermodynamic properties, such as duplexmelting temperature and/or hairpin stability of the candidate probes.Where the methods of the invention comprise a parameter for duplexmelting temperature, the duplex melting temperature may be estimated bythe GC-content of the candidate probes. An accurate determination of themelting temperature of oligonucleotide hybridization is achieved by ause of a model that considers nearest-neighbor interactions asrepresented by the nearest-neighbor parameters.

In other embodiments of the invention, the in silico parameters maycomprise a parameter for duplex stability for the candidate probes. Insome methods, the duplex stability parameter evaluates the candidateprobes for a property selected from the group consisting of meltingtemperature, entropy, enthalpy and Gibb's free energy. In otherembodiments, the hairpin structural stability parameter for the probe,and the target stability parameters may be determined by evaluating thecandidate probes for a property selected from the group consisting ofmelting temperature, entropy, enthalpy and Gibb's free energy.Alternatively, the in silico parameters may be target specificity,and/or target secondary structural stability, where target structuralstability is evaluated by a property selected from the group consistingof melting temperature, entropy, enthalpy and Gibb's free energy.

In other methods of the invention, the in silico parameters may comprisea parameter that is the maximum subsequence melting temperature of theprobe. This is the maximum duplex melting for any contiguoussub-sequence of a probe with its complementary target, where allpossible subsequences of length L are considered and where L is lessthan the probe length. This metric has been found to be informative infiltering out probes that have GC-rich regions that appear to act asnucleation sites for non-specific hybridization. For the probes wedesigned, with nominal lengths of 60 bp, the lengths of L of interestspanned from 15-30 bp.

In other methods of the invention, the in silico parameters may comprisea parameter for intergenicity of the candidate probes. When theparameter for intergenicity is utilized, this parameter evaluateswhether the candidate probe sequence is within a gene, in between a geneor within a coding region of a gene.

In other methods, the in silico parameters comprise a parameter forexpression of the candidate probes. When the parameter for expression isutilized, this parameter evaluates whether the candidate probe sequenceis within a gene, the candidate probe sequence is within an expressionregion of a gene, or the candidate probe sequence is within a codingregion of a gene.

Additionally, the in silico parameters may include the specificity ofthe probe to it's intended target. The method of the invention may alsocomprise determining a homology score expressed as an effectivesignal-to-background for each candidate probe in silico. The homologysignal-to background score for each candidate probe may be expressed inthe form of HomLogS2B.

In other embodiments, the method comprises applying a pairwise probeselection process to the candidate probes. Applying pairwise selectioncomprises analyzing neighboring probe sequences within a genomic regionof interest, evaluating the pair of neighboring probe sequences for aprobe property and then scoring the neighboring probe sequences for theprobe property, or properties, of interest. The pairwise filteringalgorithm is a means of reducing the size of a set of candidate probesto a smaller set of probes, while enriching for a specific beneficialproperty or properties. In the methods where the pairwise analysis isutilized, the probe property may be selected from the group consistingof duplex melting temperature, hairpin stability, GC content, if theprobe is within an exon, probe is within a gene, probe is within anintron and probe is within a intergenic region, or any property or scorefor combined properties of the probe or the gene in which it iscontained.

In other embodiments, the method may comprise applying a biased pairwiseprobe filtering analysis to the candidate probes. Applying a biasedpairwise selection algorithm comprises, analyzing neighboring probesequences within a genomic region of interest, evaluating theneighboring probe sequences for a first probe property or group ofproperties, evaluating the neighboring probe sequences for a secondprobe property or group of properties and scoring the neighboring probesequences for the first probe property and weighting this scoringprocess by the presence or absence of the second probe property. Whenbiased pairwise analysis is utilized, the probe properties of the firstand second parameters are selected from the group consisting of duplexmelting temperature, hairpin stability, GC content, probe is within anexon, probe is within a gene, probe is within an intron and probe iswithin a intergenic region as well as any second property or score forcombined properties of the probe or the gene in which it is contained.Alternatively, the pairwise filtering selection algorithm may utilize asingle score which combines multiple properties into a single value foreach probe.

Alternatively, applying pairwise selection analysis may compriseselecting a plurality of probe pairs, each probe pair comprising a firstprobe sequence and a second probe sequence which are adjacent probesequences within the chromosome of interest, evaluating the first andsecond probe sequences for at least one probe property, assigning atleast one score for each probe property to the first and second probesequences, and determining which probe sequence of each probe paircomprises the optimum probe characteristics for said microarray. In someembodiments the probe pairs are randomly selected for pairwise analysiswhile in other embodiments the probe pairs are selected for pairwiseanalysis by the order in which they target the chromosome or genesequence of interest. The order may be assigned in the 3′ to 5′direction or 5′ to 3′ direction. In a preferred embodiment that leads tothe construction of more uniformly spaced probe sets, the probe pairsare ordered by the base pair gap size between the first and second probesequences. Either ordering the pairs by smallest gap distance to largestor largest gap to smallest gap distance. The probe properties selectedfor pairwise analysis may be selected from the group consisting ofduplex melting temperature, hairpin stability, GC content, if probe iswithin an exon, probe is within a gene, probe is within an intron andprobe is within an intergenic region.

In certain embodiments the methods of the invention when candidateprobes are screened according to at least one experimentally measurableparameter or property, the experimentally measurable property orparameter is selected from the group consisting of signal intensity,reproducibility of signal intensity, dye bias, susceptibility tonon-specific binding, wash stability and persistence of probehybridization. In embodiments where experimentally validating candidateprobe performance is used for probe selection, validating the candidateprobes comprises hybridizing the candidate probes to a plurality oftarget sets, evaluating the candidate probes for a probe property foreach target set, and comparing the values for probe property of eachcandidate probe across a plurality of target sets.

In most embodiments of the methods of the invention, computer readablemedium carrying one or more sequences of instructions for -identifyingand selecting nucleic acid probes for detecting a target with a probearray is needed. Where the execution of one or more sequences ofinstructions by one or more processors causes the one or more processorsto perform the steps of, repeat-masking said target sequences to formnon-repeat masked regions; and tiling sequences across said non-repeatmasked regions to generate said candidate probes. In certainembodiments, the steps performed by one or more processors furthercomprises identifying restriction cut sites in a chromosome of interest,and selecting target sequences that exclude the restriction sites.

The microarrays of the invention comprise a solid support a plurality ofpolynucleotide probes attached to the support, the plurality ofpolynucleotide probes having a corresponding plurality of differentnucleotide sequences, and at least 50% of the polynucleotide probes havea duplex T_(m) within a temperature range of about 75° C. to about 85°C. In most embodiments that microarrays of the invention have at least1,000 polynucleotide probes surface bound to the support, more likely atleast 2,000 polynucleotide probes, and usually at least 10,000polynucleotide probes. In most embodiments at least 20,000polynucleotide probes are surface bound, and usually at least 40,000polynucleotide probes are bound to the solid support. In someembodiments, over 100,000 probes are bound to the solid support and insome embodiments the number of probes on the microarray is larger than400,000.

In one embodiment, at least 80% of said polynucleotide probes have aduplex T_(m) within a temperature range of about 75° C. to about 85° C.,usually about 77° C. to about 83° C., more usually from about 78° C. toabout 82° C. and even more usually about 79° C. to about 82° C.Alternatively, the microarray will have at least 90% of saidpolynucleotide probes have a duplex T_(m) within said temperature rangeof about 75° C. to about 85° C., usually about 77° C. to about 83° C.,more usually from about 78° C. to about 82° C. and even more usuallyabout 79° C. to about 82° C. The determination of T_(m) values forprobes is dependent on many factors and may vary widely depending on themethod or calculation utilized for calculating T_(m)s. Thus it is usefulto describe the invention in terms of a delta T_(m) value or range inwhich a large portion of the polynucleotide probes have a T_(m) valuewhich fall within this delta T_(m).

In other embodiments, the microarray comprises a solid support and aplurality of polynucleotide probes attached to the support, theplurality of polynucleotide probes having a corresponding plurality ofdifferent nucleotide sequences, and at least 50% of the polynucleotideprobes have a duplex Tm within a delta Tm of less than 4° C., usuallyless then 3° C., more usually 2° C. Alternatively, at least 50% of thepolynucleotide probes have a duplex Tm within a delta Tm of less than1.5° C., usually less than 1.0° C., and more usually 0.5° C.

In other embodiments, at least 80% of the polynucleotide probes have aduplex Tm within a delta Tm of less than 4° C., usually less then 3° C.,more usually 2° C. Alternatively, at least 80% of the polynucleotideprobes have a duplex Tm within a delta Tm of less than 1.5° C. In yetother embodiments, at least 90% of the polynucleotide probes have aduplex Tm within a delta Tm of less than 4° C., usually less then 3° C.,more usually 2° C.

The probes on the microarray, in certain embodiments have a nucleotidelength in the range of at least 30 nucleotides to 100 nucleotides. Inother embodiments, at least 50% of the polynucleotide probes on thesolid support have the same nucleotide length, and that length may beabout 60 nucleotides.

In some embodiments, at least 5% of the polynucleotide probes on thesolid support hybridize to regulatory regions of a nucleotide sample ofinterest while other embodiments may have at least 30% of thepolynucleotide probes on the solid support hybridize to exonic regionsof a nucleotide sample of interest. In yet other embodiments, at least50% of the polynucleotide probes on the solid support hybridizeintergenic regions of a nucleotide sample of interest.

In certain embodiments the polynucleotide probes are structured andconfigured for analysis of a nucleotide sample by comparative genomehybridization and/or location analysis. The microarrays wherein Thenucleotide sequences of the polynucleotide probes hybridize tonucleotide samples generated the human genome in some embodiments whileother microarrays comprise polynucleotide probes with nucleotidesequences which hybridize to nucleotide samples from the mouse genome.

In other embodiments, a microarray comprises a solid support; and aplurality of polynucleotide probes attached to the support, theplurality of polynucleotide probes having a corresponding plurality ofdifferent nucleotide sequences, where about 60% to about 90% of thepolynucleotide probes have a percent GC content within a delta percentGC of 10%. Alternatively about 70% to about 90% of said polynucleotideprobes have a percent GC content within a delta percent GC of 10%, andusually about 80% to about 90% of said polynucleotide probes have apercent GC content within a delta percent GC of 10%. In otherembodiments, about 40% to about 90% of said polynucleotide probes have apercent GC content within a delta percent GC of 5% and usually about 60%to about 90% of said polynucleotide probes have a percent GC contentwithin a delta percent GC of 5% and more usually about 70% to about 90%of said polynucleotide probes have a percent GC content within a deltapercent GC of 5%.

In other embodiments, about 35% to about 80% of the polynucleotideprobes have a percent GC content within a delta percent GC of 3%,usually about 40% to about 70% of said polynucleotide probes have apercent GC content within a delta percent GC of 3% and more usuallyabout 40% to about 65% of said polynucleotide probes have a percent GCcontent within a delta percent GC of 3%.

In yet other embodiments, at least 70%, usually at least 75%, moreusually at least 80%, and even more usually at least 85% of thepolynucleotide probes have a percent GC content within a delta percentGC of 10%. In other embodiments, at least 60%, and usually at least 70%,of the polynucleotide probes have a percent GC content within a deltapercent GC of 5%. In another embodiment, at least 40%, and usually atleast 50%, and more usually at least 60% of the polynucleotide probeshave a percent GC content within a delta percent GC of 3%.

The present invention also provides a computer readable medium carryingone or more sequences of instructions for identifying and selectingnucleic acid probes for detecting a target with a probe array, whereinexecution of one or more sequences of instructions by one or moreprocessors causes the one or more processors to perform the steps of,identifying restriction cut sites in a chromosome of interest, selectingtarget sequences that exclude said restriction sites, repeat-masking thetarget sequences to form non-repeat masked regions, and tiling sequencesacross the non-repeat masked regions to generate the candidate probes.

These and other advantages and features of the invention will becomeapparent to those persons skilled in the art upon reading the details ofthe methods for probe selection and microarray composition useful forCGH and location analysis as more fully described below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow chart of a process of probe selection utilizing probeselection parameters for CGH in accordance with the invention.

FIG. 2 is a flow chart of a process for filtering and reduction oftarget sequences in accordance with the invention.

FIG. 3 is a flow chart of a process for analyzing candidate probes inSilico for selecting probes for CGH arrays in accordance with theinvention.

FIG. 4 is a flow chart of a process for analyzing candidate probesutilizing empirically determine probe indicators for selecting probesfor CGH arrays in accordance with the invention.

FIG. 5 is a flow chart of a process for validating candidate probes bytarget hybridization experiments for selecting probes for CGH arrays inaccordance with the invention.

FIG. 6 is a flow chart of a pairwise process for analyzing candidateprobes for CGH arrays in accordance with the invention.

FIG. 7 is a histogram of all candidate probes for chromosome 16 to besubsequently analyzed using the pairwise probe filtering process inaccordance with the invention.

FIG. 8 is a histogram of filtered probe distributions demonstrating theprobe filtering achieved by a pairwise analysis in accordance with theinvention.

FIG. 9 is a graph of probes plotted along a small region of chromosome16 where a subset of the probes plotted were selected by a biasedpairwise analysis in accordance -with the invention.

FIG. 10 is a block diagram of a microarray comprising probes withselected duplex Tm properties in accordance of the invention.

FIG. 11 is a graph of the duplex melting temperatures of probes of anexpression array compared to the duplex melting temperatures of probeson a microarray in accordance of the present invention.

FIG. 12 is a graph of the fraction of probes within a differentialduplex melting temperature range for probes on a gene expression arraycompared to probes on a microarray in accordance with the invention.

FIG. 13 is a graph of the % GC content of probes of an expression array,compared to the % GC content of probes on a microarray in accordance ofthe present invention.

FIG. 14 is a graph of the fraction of probes within a differential/delta% GC content for probes on a gene expression array compared to probes ona microarray in accordance with the invention.

FIG. 15 is a block diagram illustrating an example of a generic computersystem which may be used in implementing the present invention.

FIGS. 16 a, b, c are histograms showing the relative signal strength ofcandidate probes of various lengths in accordance with the invention.

FIG. 17 a is histogram showing the responsiveness of candidate probes toduplex Tm in accordance with the invention.

FIG. 17 b is histogram showing the responsiveness of candidate probes tobinding persistence analysis in accordance with the invention.

FIG. 17 c is a histogram showing the responsiveness of candidate probesto HomoLogS2B analysis in accordance with the invention.

DETAILED DESCRIPTION OF THE INVENTION

Before the present methods for CGH probe selection are described, it isto be understood that this invention is not limited to particular genesor chromosomes described, as such may, of course, vary. It is also to beunderstood that the terminology used herein is for the purpose ofdescribing particular embodiments only, and is not intended to belimiting, since the scope of the present invention will be limited onlyby the appended claims.

Where a range of values is provided, it is understood that eachintervening value, to the tenth of the unit of the lower limit unlessthe context clearly dictates otherwise, between the upper and lowerlimits of that range is also specifically disclosed. Each smaller rangebetween any stated value or intervening value in a stated range and anyother stated or intervening value in that stated range is encompassedwithin the invention. The upper and lower limits of these smaller rangesmay independently be included or excluded in the range, and each rangewhere either, neither or both limits are included in the smaller rangesis also encompassed within the invention, subject to any specificallyexcluded limit in the stated range. Where the stated range includes oneor both of the limits, ranges excluding either or both of those includedlimits are also included in the invention.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Although any methods andmaterials similar or equivalent to those described herein can be used inthe practice or testing of the present invention, the preferred methodsand materials are now described. All publications mentioned herein areincorporated herein by reference to disclose and describe the methodsand/or materials in connection with which the publications are cited.

It must be noted that as used herein and in the appended claims, thesingular forms “a”, “and”, and “the” include plural referents unless thecontext clearly dictates otherwise. Thus, for example, reference to “apolynucleotide” includes a plurality of such polynucleotides andreference to “the target fragment” includes reference to one or moretarget fragment and equivalents thereof known to those skilled in theart, and so forth.

The publications discussed herein are provided solely for theirdisclosure prior to the filing date of the present application. Nothingherein is to be construed as an admission that the present invention isnot entitled to antedate such publication by virtue of prior invention.Further, the dates of publication provided may be different from theactual publication dates which may need to be independently confirmed.

Definitions

The term “nucleic acid” and “polynucleotide” are used interchangeablyherein to describe a polymer of any length, e.g., greater than about 10bases, greater than about 100 bases, greater than about 500 bases,greater than 1000 bases, usually up to about 10,000 or more basescomposed of nucleotides, e.g., deoxyribonucleotides or ribonucleotides,or compounds produced synthetically (e.g., PNA as described in U.S. Pat.No. 5,948,902 and the references cited therein) which can hybridize withnaturally occurring nucleic acids in a sequence specific manneranalogous to that of two naturally occurring nucleic acids, e.g., canparticipate in Watson-Crick base pairing interactions.

The terms “ribonucleic acid” and “RNA” as used herein mean a polymercomposed of ribonucleotides.

The terms “deoxyribonucleic acid” and “DNA” as used herein mean apolymer composed of deoxyribonucleotides.

The term “oligonucleotide” as used herein denotes single strandednucleotide multimers of from about 10 to 100 nucleotides and up to 200nucleotides in length. Oligonucleotides are usually synthetic and, inmany embodiments, are under 50 nucleotides in length.

The term “oligomer” is used herein to indicate a chemical entity thatcontains a plurality of monomers. As used herein, the terms “oligomer”and “polymer” are used interchangeably, as it is generally, although notnecessarily, smaller “polymers” that are prepared using thefunctionalized substrates of the invention, particularly in conjunctionwith combinatorial chemistry techniques. Examples of oligomers andpolymers include polydeoxyribonucleotides (DNA), polyribonucleotides(RNA), other nucleic acids that are C-glycosides of a purine orpyrimidine base, polypeptides (proteins), polysaccharides (starches, orpolysugars), and other chemical entities that contain repeating units oflike chemical structure.

The term “sample” as used herein relates to a material or mixture ofmaterials, typically, although not necessarily, in fluid form,containing one or more components of interest.

The terms “nucleoside” and “nucleotide” are intended to include thosemoieties that contain not only the known purine and pyrimidine bases,but also other heterocyclic bases that have been modified. Suchmodifications include methylated purines or pyrimidines, acylatedpurines or pyrimidines, alkylated riboses or other heterocycles. Inaddition, the terms “nucleoside” and “nucleotide” include those moietiesthat contain not only conventional ribose and deoxyribose sugars, butother sugars as well. Modified nucleosides or nucleotides also includemodifications on the sugar moiety, e.g., wherein one or more of thehydroxyl groups are replaced with halogen atoms or aliphatic groups, orare functionalized as ethers, amines, or the like.

The phrase “surface-bound polynucleotide” refers to a polynucleotidethat is immobilized on a surface of a solid substrate, where thesubstrate can have a variety of configurations, e.g., a sheet, bead, orother structure. In certain embodiments, the collections ofoligonucleotide probe elements employed herein are present on a surfaceof the same planar support, e.g., in the form of an array.

The phrase “labeled population of nucleic acids” refers to mixture ofnucleic acids that are detectably labeled, e.g., fluorescently labeled,such that the presence of the nucleic acids can be detected by assessingthe presence of the label. A labeled population of nucleic acids is“made from” a chromosome sample, the chromosome sample is usuallyemployed as template for making the population of nucleic acids.

The term “array” encompasses the term “microarray” and refers to anordered array presented for binding to nucleic acids and the like.

An “array,” includes any two-dimensional or substantiallytwo-dimensional (as well as a three-dimensional) arrangement ofspatially addressable regions bearing nucleic acids, particularlyoligonucleotides or synthetic mimetics thereof, and the like. Where thearrays are arrays of nucleic acids, the nucleic acids may be adsorbed,physisorbed, chemisorbed, or covalently attached to the arrays at anypoint or points along the nucleic acid chain.

Any given substrate may carry one, two, four or more arrays disposed ona front surface of the substrate. Depending upon the use, any or all ofthe arrays may be the same or different from one another and each maycontain multiple spots or features. A typical array may contain one ormore, including more than two, more than ten, more than one hundred,more than one thousand, more ten thousand features, or even more thanone hundred thousand features, in an area of less than 20 cm² or evenless than 10 cm², e.g., less than about 5 cm², including less than about1 cm², less than about 1 mm², e.g., 100 μm², or even smaller. Forexample, features may have widths (that is, diameter, for a round spot)in the range from a 10 μm to 1.0 cm. In other embodiments each featuremay have a width in the range of 1.0 μm to 1.0 mm, usually 5.0 μm to 500μm, and more usually 10 μm to 200 μm. Non-round features may have arearanges equivalent to that of circular features with the foregoing width(diameter) ranges. At least some, or all, of the features are ofdifferent compositions (for example, when any repeats of each featurecomposition are excluded the remaining features may account for at least5%, 10%, 20%, 50%, 95%, 99% or 100% of the total number of features).Inter-feature areas will typically (but not essentially) be presentwhich do not carry any nucleic acids (or other biopolymer or chemicalmoiety of a type of which the features are composed). Such inter-featureareas typically will be present where the arrays are formed by processesinvolving drop deposition of reagents but may not be present when, forexample, photolithographic array fabrication processes are used. It willbe appreciated though, that the inter-feature areas, when present, couldbe of various sizes and configurations.

Each array may cover an area of less than 200 cm², or even less than 50cm², 5 cm², 1 cm², 0.5 cm², or 0.1 cm². In certain embodiments, thesubstrate carrying the one or more arrays will be shaped generally as arectangular solid (although other shapes are possible), having a lengthof more than 4 mm and less than 150 mm, usually more than 4 mm and lessthan 80 mm, more usually less than 20 mm; a width of more than 4 mm andless than 150 mm, usually less than 80 mm and more usually less than 20mm; and a thickness of more than 0.01 mm and less than 5.0 mm, usuallymore than 0.1 mm and less than 2 mm and more usually more than 0.2 andless than 1.5 mm, such as more than about 0.8 mm and less than about 1.2mm. With arrays that are read by detecting fluorescence, the substratemay be of a material that emits low fluorescence upon illumination withthe excitation light. Additionally in this situation, the substrate maybe relatively transparent to reduce the absorption of the incidentilluminating laser light and subsequent heating if the focused laserbeam travels too slowly over a region. For example, the substrate maytransmit at least 20%, or 50% (or even at least 70%, 90%, or 95%), ofthe illuminating light incident on the front as may be measured acrossthe entire integrated spectrum of such illuminating light oralternatively at 532 nm or 633 nm.

Arrays can be fabricated using drop deposition from pulse-jets of eithernucleic acid precursor units (such as monomers) in the case of in situfabrication, or the previously obtained nucleic acid. Such methods aredescribed in detail in, for example, the previously cited referencesincluding U.S. Pat. No. 6,242,266, U.S. Pat. No. 6,232,072, U.S. Pat.No. 6,180,351, U.S. Pat. No. 6,171,797, U.S. Pat. No. 6,323,043, U.S.patent application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren etal., and the references cited therein. As already mentioned, thesereferences are incorporated herein by reference. Other drop depositionmethods can be used for fabrication, as previously described herein.Also, instead of drop deposition methods, photolithographic arrayfabrication methods may be used. Inter-feature areas need not be presentparticularly when the arrays are made by photolithographic methods asdescribed in those patents.

An array is “addressable” when it has multiple regions of differentmoieties (e.g., different oligonucleotide sequences) such that a region(i.e., a “feature” or “spot” of the array) at a particular predeterminedlocation (i.e., an “address”) on the array will detect a particularsequence. Array features are typically, but need not be, separated byintervening spaces. In the case of an array in the context of thepresent application, the “population of labeled nucleic acids” will bereferenced as a moiety in a mobile phase (typically fluid), to bedetected by “surface-bound polynucleotides” which are bound to thesubstrate at the various regions. These phrases are synonymous with thearbitrary terms “target” and “probe”, or “probe” and “target”,respectively, as they are used in other publications.

A “scan region” refers to a contiguous (preferably, rectangular) area inwhich the array spots or features of interest, as defined above, arefound or detected. Where fluorescent labels are employed, the scanregion is that portion of the total area illuminated from which theresulting fluorescence is detected and recorded. Where other detectionprotocols are employed, the scan region is that portion of the totalarea queried from which resulting signal is detected and recorded. Forthe purposes of this invention and with respect to fluorescent detectionembodiments, the scan region includes the entire area of the slidescanned in each pass of the lens, between the first feature of interest,and the last feature of interest, even if there exist intervening areasthat lack features of interest.

An “array layout” refers to one or more characteristics of the features,such as feature positioning on the substrate, one or more featuredimensions, and an indication of a moiety at a given location.“Hybridizing” and “binding”, with respect to nucleic acids, are usedinterchangeably.

The term “stringent assay conditions” as used herein refers toconditions that are compatible to produce binding pairs of nucleicacids, e.g., probes and targets, of sufficient complementarity toprovide for the desired level of specificity in the assay while beingincompatible to the formation of binding pairs between binding membersof insufficient complementarity to provide for the desired specificity.The term stringent assay conditions refers to the combination ofhybridization and wash conditions.

A “stringent hybridization” and “stringent hybridization washconditions” in the context of nucleic acid hybridization (e.g., as inarray, Southern or Northern hybridizations) are sequence dependent, andare different under different environmental parameters. Stringenthybridization conditions that can be used to identify nucleic acidswithin the scope of the invention can include, e.g., hybridization in abuffer comprising 50% formamide, 5×SSC, and 1% SDS at 42° C., orhybridization in a buffer comprising 5×SSC and 1% SDS at 65° C., bothwith a wash of 0.2×SSC and 0.1% SDS at 65° C. Exemplary stringenthybridization conditions can also include a hybridization in a buffer of40% formamide, 1 M NaCl, and 1% SDS at 37° C., and a wash in 1×SSC at45° C. Alternatively, hybridization to filter-bound DNA in 0.5 M NaHPO₄,7% sodium dodecyl sulfate (SDS), 1 mnM EDTA at 65° C., and washing in0.1×SSC/0.1% SDS at 68° C. can be employed. Yet additional stringenthybridization conditions include hybridization at 60° C. or higher and3×SSC (450 mM sodium chloride/45 mM sodium citrate) or incubation at 42°C. in a solution containing 30% formamide, 1M NaCl, 0.5% sodiumsarcosine, 50 mM MES, pH 6.5. Those of ordinary skill will readilyrecognize that alternative but comparable hybridization and washconditions can be utilized to provide conditions of similar stringency.

In certain embodiments, the stringency of the wash conditions determinewhether a nucleic acid is specifically hybridized to a probe. Washconditions used to identify nucleic acids may include, e.g.: a saltconcentration of about 0.02 M at pH 7 and a temperature of about 20° C.to about 40° C.; or, a salt concentration of about 0.15 M NaCl at 72° C.for about 15 minutes; or, a salt concentration of about 0.2×SSC at atemperature of about 30° C. to about 50° C. for about 2 to about 20minutes; or, the hybridization complex is washed twice with a solutionwith a salt concentration of about 2×SSC containing 1% SDS at roomtemperature for 15 minutes and then washed twice by 0.1×SSC containing0.1% SDS at 37° C. for 15 minutes; or, equivalent conditions. Stringentconditions for washing can also be, e.g., 0.2×SSC/0.1% SDS at 42° C. SeeSambrook, Ausubel, or Tijssen (cited below) for detailed descriptions ofequivalent hybridization and wash conditions and for reagents andbuffers, e.g., SSC buffers and equivalent reagents and conditions.

A specific example of stringent assay conditions is rotatinghybridization at 65° C. in a salt based hybridization buffer with atotal monovalent cation concentration of 1.5M (e.g., as described inU.S. patent application Ser. No. 09/655,482 filed on Sep. 5, 2000, thedisclosure of which is herein incorporated by reference) followed bywashes of 0.5×SSC and 0.1×SSC at room temperature.

Stringent hybridization conditions may also include a “prehybridization”of aqueous phase nucleic acids with complexity-reducing nucleic acids tosuppress repetitive sequences. For example, certain stringenthybridization conditions include, prior to any hybridization tosurface-bound polynucleotides, hybridization with Cot-1 DNA, or thelike.

Stringent assay conditions are hybridization conditions that are atleast as stringent as the above representative conditions, where a givenset of conditions are considered to be at least as stringent ifsubstantially no additional binding complexes that lack sufficientcomplementarity to provide for the desired specificity are produced inthe given set of conditions as compared to the above specificconditions, where by “substantially no more” is meant less than about5-fold more, typically less than about 3-fold more. Other stringenthybridization conditions are known in the art and may also be employed,as appropriate.

The term “mixture”, as used herein, refers to a combination of elements,that are interspersed and not in any particular order. A mixture isheterogeneous and not spatially separable into its differentconstituents. Examples of mixtures of elements include a number ofdifferent elements that are dissolved in the same aqueous solution, or anumber of different elements attached to a solid support at random or inno particular order in which the different elements are not especiallydistinct. In other words, a mixture is not addressable. To be specific,an array of surface-bound polynucleotides, as is commonly known in theart and described below, is not a mixture of capture agents because thespecies of surface-bound polynucleotides are spatially distinct and thearray is addressable.

“Isolated” or “purified” generally refers to isolation of a substance(compound, polynucleotide, protein, polypeptide, polypeptide,chromosome, etc.) such that the substance comprises the majority percentof the sample in which it resides. Typically in a sample a substantiallypurified component comprises 50%, preferably 80%-85%, more preferably90-95% of the sample. Techniques for purifying polynucleotides,polypeptides and intact chromosomes of interest are well-known in theart and include, for example, ion-exchange chromatography, affinitychromatography, sorting, and sedimentation according to density.

The terms “assessing” and “evaluating” are used interchangeably to referto any form of measurement, and include determining if an element ispresent or not. The terms “determining,” “measuring,” and “assessing,”and “assaying” are used interchangeably and include both quantitativeand qualitative determinations. Assessing may be relative or absolute.“Assessing the presence of includes determining the amount of somethingpresent, as well as determining whether it is present or absent.

The term “using” has its conventional meaning, and, as such, meansemploying, e.g., putting into service, a method or composition to attainan end. For example, if a program is used to create a file, a program isexecuted to make a file, the file usually being the output of theprogram. In another example, if a computer file is used, it is usuallyaccessed, read, and the information stored in the file employed toattain an end. Similarly if a unique identifier, e.g., a barcode isused, the unique identifier is usually read to identify, for example, anobject or file associated with the unique identifier.

If a surface-bound polynucleotide “corresponds to” a chromosome, thepolynucleotide usually contains a sequence of nucleic acids that isunique to that chromosome. Accordingly, a surface-bound polynucleotidethat corresponds to a particular chromosome usually specificallyhybridizes to a labeled nucleic acid made from that chromosome, relativeto labeled nucleic acids made from other chromosomes. Array features,because they usually contain surface-bound polynucleotides, can alsocorrespond to a chromosome.

A “non-cellular chromosome composition”, as will be discussed in greaterdetail below, is a composition of chromosomes synthesized by mixingpre-determined amounts of individual chromosomes. These syntheticcompositions can include selected concentrations and ratios ofchromosomes that do not naturally occur in a cell, including any cellgrown in tissue culture. Non-cellular chromosome compositions maycontain more than an entire complement of chromosomes from a cell, and,as such, may include extra copies of one or more chromosomes from thatcell. Non-cellular chromosome compositions may also contain less thanthe entire complement of chromosomes from a cell.

A “probe” means a polynucleotide which can specifically hybridize to atarget nucleotide, either in solution or as a surface-boundpolynucleotide.

The term “validated probe” means a probe that has been passed by atleast one screening or filtering process in which experimental datarelated to the performance of the probes was used a part of theselection criteria.

“In silico” means those parameters that can be determined without theneed to perform any experiments, by using information either calculatedde novo or available from public or private databases.

The term “duplex Tm” refers to the melting temperature of twooligonucleotides which have formed a duplex structure.

The present invention provides alternative and novel microarrays,methods and systems for CGH and location analysis probe microarrayselection that overcome the drawbacks of existing microarray probeselection techniques. The methods of the instant invention utilizeprobe/target hybridization experiments and/or unique data analysistechniques to identify and select nucleotide probe(s) that targetpolynucleotide fragments from a chromosome of interest. The methods forprobe selection described within, will benefit from flexible microarrayfabrication technologies that can rapidly customize array content, asmore information is forthcoming on which regions of particularchromosomes/genes are important for disease/cancer development as wellas disease diagnostics.

The invention provides methods, systems and computer readable media foridentifying and selecting nucleic acid probes for detecting a targetwith a nucleic acid probe array or microarray. The methods comprise, ingeneral terms. the selection of genomic nucleotide ranges of interest,determining appropriate target sequences for CGH and/or locationanalysis, generating candidate probes specific for the target sequencesand analyzing candidate probes for specific probe properties bycomputational and/or experimental processes to optimize probe selectionand reduce the number of probes to a value appropriate for placement ona microarray. The invention also provides microarrays comprising probesselected by the methods of the invention. The microarrays comprise asolid support and a plurality of surface bound probes, the-surface boundprobes having very similar thermodynamic properties as well as similarGC content. More specifically, a large portion of the probes utilized inthe microarrays of the invention, have duplex melting temperatures (Tm)which are within a narrow temperature range compared the Tm range ofprobes for other microarray systems, such as arrays for gene expression.

The invention is particularly useful with comparative genomehybridization microarrays, such as microarrays based on the human ormouse genome. The invention permits more cost-effective and efficientidentification of gene regions or sections which can be associated withhuman disease, points of therapeutic intervention, and potential toxicside-effects of proposed therapeutic entities.

In general terms, the methods for probe selection and validation of theinvention comprise, identifying probe properties that can be determineda priori by the probe's sequence and the sequence of the genome it iscontained within, and may further comprise expanding the set ofproperties from those that can be determined a priori, to those that canbe measured empirically through simple experiments, such as self-selfexperiments. The methods of the invention may further comprise measuringthe response of candidate probes to a known stimulus, where the stimulusis generated by a set of samples of where the copy numbers forrelatively small subsets of the genome are altered in known ways.

In designing an array comprising high-performance probes thatcomprehensively covers a whole genome (e.g. the human genome) the entiregenomic sequence must be searched when generating specific candidateprobes. This homology search is potentially the most time-consuming partof the probe design process. Ideally, a homology search would be thefirst part of the process, however because of the scale of the humangenome executing an exhaustive search of all possible short oligo probes(<100 bases), can take computation time on the scale of a CPU year(based on ProbeSpec), for modern 3 GHz processors. This computation timecan be reduced by any of a number of methods, most involving reducingthe scale of the search. For example, known highly repetitive sequencescan be removed by a process called RepeatMasking. Repeat-masked genomicsequences are publicly available on the web (e.g. UCSC'swww.genomebrowser.org). Another approach is to reduce the number ofprobe sequences being searched up-front. This can be done on the basisof any known property of the probe, from thermodynamic properties, suchas duplex-Tm and hairpin free energy, to position on the genome. Thepresent invention provides methods which applies known probe informationas a screening process to reduce the number of probe sequences to beanalyzed in a homology search, thus reducing the computation time neededto identify appropriate probes for a CGH based array.

The present systems, techniques, methods and computer readable mediaalso provide for streamlined workflow, since researchers need only toprepare and process one microarray instead of two or more per sample,with fewer steps in processing and tracking required.

Further, greater reproducibility of results is provided for, since alldata for an entire genome is generated from a single microarray,resulting in less variability in the data. When two or more microarraysassociated with the same sample are processed separately, there arealways questions of variability of the experimental conditions used toprocess each microarray.

Designing a microarray involves determining the amount of “real estate”(number of probes) that is available for the final array. The arraydesigner also determines the amount of probes or “real estate” to usefor specified regulatory regions, intergenic regions as well the amountof probes necessary to adequately cover introns and exons of thechromosomes of interest. Initially, a designer will generate 20 to 40million candidate probes and need to filter the probes for certain probeproperties or parameters to obtain a final array with approximately40,000 probes. Intermediate arrays are manufactured in some embodimentsof the methods of the invention, which have a redundancy of 3 or 4 foldover the number of probes selected for the final array, theseintermediate arrays are utilized to screen candidate probes for certainprobe properties by direct or indirect experimentation.

In many embodiments, the oligonucleotides (i.e. probes) contained in thefeatures of the invention have been designed according to one or moreparticular parameters to be suitable for use in a given application,where representative parameters include, but are not limited to: length,melting temperature (Tm), non-homology with other regions of the genome,hybridization signal intensities, kinetic properties under hybridizationconditions, etc., see e.g., U.S. Pat. No. 6,251,588, the disclosure ofwhich is herein incorporated by reference.

Standard hybridization techniques (using high stringency hybridizationconditions) are used to probe subject array. Suitable methods aredescribed in references describing CGH techniques (Kallioniemi et al.,Science 258:818-821 (1992) and WO 93/18186). Several guides to generaltechniques are available, e.g., Tijssen, Hybridization with Nucleic AcidProbes, Parts I and II (Elsevier, Amsterdam 1993). For a descriptions oftechniques suitable for in situ hybridizations see, Galla et al. Meth.Enzymol., 21:470-480 (1981) and Angerer et al. in Genetic Engineering:Principles and Methods Setlow and Hollaender, Eds. Vol 7, pgs 43-65(plenum Press, New York 1985). See also U.S. Pat. Nos. 6,335,167;6,197,501; 5,830,645; and 5,665,549; the disclosures of which are hereinincorporate by reference.

Referring now to FIG. 1, there is shown a flow chart of events that maybe carried out in a nucleic acid probe selection method in accordancewith the invention. At event 10, a nucleotide sample is selected forprobe design for microarray analysis. The nucleotide sample may be agenome or genomic nucleotide range or ranges, such as a chromosome. Atevent 20, potential target sequences of the nucleotide sample ofinterest are identified, filtered and reduced to a set of appropriatetarget sequences for CGH and/or location analysis. The potential targetsequences are filtered by size, number of repeat-masked bases and/orGC-content. Target sequences are also filtered and reduced in number byeliminating repetitive target sequences in event 20. Another parameterwhich can be used to filter target sequences, is to eliminate potentialtarget sequences which comprise a restriction enzyme cut site. Bylimiting the size of the set of target sequences, the computational timeneeded to generate and analyze the candidate probes is decreased. Theseand other processes involved in obtaining appropriate target sequencesfor CGH probe selection are more fully described in FIG. 2 below.

After determining a set of appropriate target sequences in event 20,candidate probes to the genomic sequence (e.g. chromosome) of interestare generated at event 30 as shown in FIG. 1. Generating a set ofcandidate probes comprises tiling probes across regions of the targetsequences determined in event 20, which enables the candidate probes tobe free of repeat-masked section as well as restriction cut sites ifdesired. The generation of candidate probes may comprise additionalfiltering and reduction depending on the genomic sequence of interest.

At event 40, the candidate probes are filtered or reduced in totalnumbers by utilizing indicators or metrics of certain probe propertieswhich assess candidate probe quality in silico. In silico means thoseparameters that can be determined without the need to perform anyexperiments, by using information either calculated de novo or availablefrom public or private databases. Probe parameters utilized to annotatecandidate probes may include but are not limited to target specificity,thermodynamic properties, expression and association with genes,homology and also kinetic properties. The annotation of candidate probesin silico, by these and other probe properties are more fully describedin FIG. 3 below. Candidate probes which do not meet the in silicoparameters or indicators for a “good” probe are discarded from the probeselection process at event 42.

Candidate probes which are identified to have certain desirable probeproperties in silico, are subjected to a pairwise selection process tofilter and reduce the number of potential probes at event 50. Thepairwise filtering evaluates a pair of candidate probes for a probeproperty or set of property and scores the probes within the pairagainst each other according to the probe property analyzed. FIG. 6describes in more detail the process for pairwise filtering. Probeswhich do not pass the pairwise selection process are not selected andare discarded in event 52. Probes which pass pairwise filtering mayrequire further filtering and can be evaluated experimentally for otherdesirable probe properties at event 60. In certain embodiments,selecting probes for a CGH or based array requires no further filteringor reduction of candidate probes besides for those applied by thepairwise and in silica analysis as shown in event 54. As more indicatorsand metrics for probe performance are identified and adapted foranalyzing -probe performance in silico, less emphasis is placed onexperimental probe results for CGH probe selection.

In the method shown in FIG. 1, candidate probes which meet the pairwisefiltering may require further analysis by measuring specific “good”probe indicators/probe properties experimentally at event 60. To obtaina sense of a probe's performance, experiments are completed whichmeasure properties of a probe that can, in the absence of more directexperiments, provide a good indication if a probe will be suitable for aCGH or location analysis array. Such experimentally measurableproperties useful in determining a candidate probes performance includebut are not limited to; raw signal intensity, reproducibility of signalintensity, dye bias, and susceptibility of non-specific binding. Theseempirically determined probe indicators as wells as others, aredescribed more fully in FIG. 4 below.

Candidate probes which do not meet the experimentally measurable probeparameters are discarded/unselected in event 70, while the remainingcandidate probes which meet the probe parameter standards in event 60may be utilized for CGH arrays, event 72 or be subjected to furtherfiltering by completing probe validation experiments at event 80. Theorder in which experimentally measurable probe parameters are applied tocandidate probes may vary depending on the genomic sequence of interest.

At event 80, candidate probes are placed on an array and subjected totarget sets/samples comprising known target sequences with known copynumbers. The probes are evaluated and scored by assessing a plurality ofprobe properties over numerous target sets. The details of the probeproperties and the methods utilized in probe validation experiments aredescribed in more detail in FIG. 5 below.

The candidate probes are evaluated in event 80 for adequate signalresponse as well as reproducibility across target sets. The candidateprobes which obtain a high validation score from the validationexperiments are suitable for use on a CGH array, event 72, whilecandidate probes with deficient or poor validation scores are notselected in event 90.

Depending on the space available on the array chip, more or fewer probeparameters can be implemented and/or the thresholds and cut-offs ofprobe parameters may be adjusted as needed. Candidate probes may beprioritized, for example gene-by-gene, region-by region, or strictlyfiltered on validation scores. Also annotation of probes for position,gene association and expression may also be utilized to finalize theprobe selection for a CGH or location array.

Referring now to FIG. 2, there is shown a flow chart of one embodimentof the invention, showing the events for identifying and filteringpotential target sequences. At event 100, the genome range (e.g. achromosome) of interest is analyzed by a process which identifiespotential cut sites for various restriction digest enzymes. Manyscientist utilize restriction enzymes in CGH experiments to producepoly-nucleotide fragments from target samples, thus in some embodimentsit is preferable to eliminate probes with restriction enzyme sites fromthe probe selection process. While event 100 is optional in certainembodiments of the methods of the invention, it can be a usefulfiltering tool which decreases the computational time needed to analyzecandidate probes as well as a desired feature by some scientist.Exemplary restriction enzymes which can be utilized in a CGH protocolare RSA1, Alu1. Screening the chromosome of interest for restrictionsites allows for the filtering and reduction of potential targetfragments which occurs in event 110. Target sequences which include arestriction cut site are eliminated from the set of potential targetfragments which are the basis for generating CGH probes. Reducing thenumber of potential target fragments at an early event in the methodsfor CGH probe selection, enables less computational power and time to beneeded in identifying optimal CGH probes.

In the embodiment shown in FIG. 2, potential target sequences excludingrestriction sites are further evaluated to determine if they meetcertain criteria for nominal probe length in event 120. Potential targetfragments are evaluated for appropriate length, for example thoseshorter than e.g. 60 base pairs are not considered initially andfragments with length greater than 800 base pairs are also put aside.The length cutoff parameters may be adjusted depending on many factors,for example, number of potential target fragments, characteristics ofthe genomic range of interest, dependence of hybridization rate ontarget length, processivity of labeling enzymes, and visual inspectionof longer target sequences for repetitions.

Those target fragments which do not meet the length criteria arediscarded or put aside in event 130. It should be noted that thesetarget fragments maybe revisited at a latter time in the method ofselecting CGH probes if it is determined that cutoff or threshold fortarget fragment length needs to be adjusted.

Potential target fragments may further be filtered by excluding targetfragments containing repetitive sequences in event 140. RepeatMasker, asoftware program, is another useful tool in eliminating regions of thegenomic sequences from becoming potential target fragments because theyare known repetitive sequences. RepeatMasker uses a database of knownsequences and algorithms to determine repetitive sequences in order tomask them in any sequence. Those target fragments which do not meet thenon-repeat mask criteria are discarded or set aside in event 150. Again,by filtering out and reducing the number of target fragments at an earlyevent in methods for CGH probe selection, the computational time toevaluate specific probe parameters at latter events is reducedsignificantly.

In event 160, the target fragments which have met the criteria forrepeat masked and nominal probe length are subjected to probe tiling.Probe tiling comprises computationally producing candidate probesequences from the sequences of the target fragments. An arbitrary probelength is determined or chosen and the target fragment sequences aredivided into segments having the specified probe length. For example, ifa probe length of 60 base pairs is chosen, the non-repeat-masked regionsof target fragments are tiled in steps of 30 bases to produce candidateprobes in event 30. The probe tiling procedure starts a new probe at thefirst non repeat-masked base within a target fragment sequence, when arepeat-masked section is encountered, the sequence is skipped and thetiling process restarts at the next non-repeat masked base. The probelength and/or tiling step size may be altered to allow for more relaxedor stringent parameters for candidate probe generation.

The use of certain probe properties as in silico indicators for “good”probes allows the reduction of the total number of viable candidateprobes. Referring now to FIG. 3, a flow chart is shown depicting oneembodiment of the methods for analyzing candidate probes by certainprobe properties in silico in accordance with the invention. Candidateprobes generated from target fragment tiling in event 30, are subjectedto annotation for expression and association with genes at event 170. Inthis event, a number of databases are utilized to determine probesequences that are contained within introns or exons of either knowngenes or predicted genes. This process involves accessing the messagealignments to genomic sequences, and determining whether the probes areeither partially or wholly contained within the confines of the exons ofthe messages. The utility of this process allows the operator toarbitrarily choose the densities for the homology searching of intronicprobes (probes within introns), exonic probes (probes within exons), andintergenic (probes between genes) probes. The candidate probes may beannotated independently for whether they are in expressed intron regionsor within the bounds of a message or within a “premRNA”.

In event 180, candidate probes are analyzed for target specificity.Specificity is the measure of the incidences of target sequencescomplementary to, or nearly complementary to a candidate probe.Complementary refers to a sequence that can form a duplex that isWatson-Crick base-paired (or analogously paired by virtue of allpossible nucleotide analogs and generic bases) to another sequencewithin the genome. “Nearly complementary” means that the sequences arecomplementary enough to form moderately stable duplexes, though somesmall number of bases are not Watson-Crick base-paired, due tomismatches, insertions, or deletions of bases in one sequence relativeto the other.

By comparing the sequences of candidate probes to the entire genomicsequence, the specificity of the probes to their respective targets inthe hybridized solution can be calculated. This is possible for human,mouse, rat and other species where the genomes have been sequencedeither completely or nearly so.

Tools and methodologies to determine specificity include, but are by nomeans limited to: BLAST® (Basic Local Alignment Search Tool), MegaBLAST,BLAT (BLAST-like Alignment Search Tool), ProbeSpec, RepeatMasker.

BLAST, MegaBLAST, and BLAT are tools widely used for genomic andexpression sequence analysis. They are all used to find sequences withina sequence data set (the genomic sequence for the organism of interest)that are similar to the query sequence (in this case, the probesequence) to within some minimal number of matching bases. Each genomicsequence found that is similar to the intended target sequence theposition and match properties, such as match start and stop positions,numbers of insertions and deletions, etc., is scored as a “hit”. Eachhit is recorded as a separate record to an output file. MegaBLAST isquite similar to BLAST, and uses similar underlying algorithms,optimized for aligning sequences that differ slightly as a result ofsequencing or other similar “errors”. MegaBLAST can run many timesfaster than BLAST, depending on the size of the candidate probessequences. BLAT differs from BLAST in that it is designed for transcriptalignment, so multiple alignments of a single transcript to a finiteregion of the genome are returned in a single record.

BLAST and MegaBLAST have the disadvantage that any non-specific probecan generate a large number of “hits”, each hit resulting in a separaterecord in the output file produced. This means that where probes arehighly nonspecific, the output file can potentially become quite large.Although it is possible to limit the number of records produced, it'ssometimes quite useful to know how many hits were found. As a result, aseparate tool with a fast file parser, is then necessary to take theBLAST results file and process it to generate a histogram of hits, aspecificity score or specificity classification for each candidateprobe. The use of ProbeSpec avoids this problem.

ProbeSpec, a COM-object, is well-suited for designing and selectingprobes for a CGH array since rather than keeping information pertainingto every hit for each probe, this program retains only the histogram ofthe “distances” of each hit relative to the query (or probe) sequence aswell as the information for the nearest match and the first exactmatching sequence. The term “distance” refers to the number of basedifferences between the probe and a close target sequence in thebackground (other genomic stretches in the sample mix). ProbeSpecprovides sufficient candidate probe information for the probe designprocess, without additional extraneous data. ProbeSpec reports thenumber of exact hits for each candidate probe, and also thedistributions of the closer set of sequence matches up to somearbitrarily predetermined distance.

It should be noted that the number of mismatched, inserted, or deletedbases that must be considered (searched for and counted) in the homologysearch, depends on the length of the intended probe, the destabilizingeffect of the mismatches on the duplex stability, and the stringency ofthe conditions of the hybridization reaction. Typically, for longerprobes more mismatches must be considered. For example, a reasonableminimum number of mismatched bases to consider for 30-mer probes is 7,whereas for 60-mer probes 15 or more mismatches should be considered.

At event 190, the candidate probes are annotated thermodynamically insilico. Thermodynamic parameters associated with duplex stability,target structural stability, and probe hairpin stability are utilized toevaluate the candidate probes. Duplex stability is the stability of theduplex formed between the probe and its target. Target structuralstability evaluates the stability of secondary structure within thetarget sequence. Probe hairpin stability is the stability of secondarystructure within the probe sequence. The presence of stable secondarystructure interferes with the ability of the target to hybridize withthe probe. The thermodynamic parameters utilized to annotate candidateprobes include but are not limited to, melting temperature (T_(m)),entropy (ΔS), enthalpy (ΔH), and Gibb's free energy (ΔG) values.Candidate probes may also be analyzed for base content, i.e. GC-content,which gives a good estimation of Tm for longer candidate probes.

At event 200, candidate probes are evaluated for homology signal tobackground and scored accordingly. The target specificity of a candidateprobe is related to the homology of the probe with respect to the restof the genome, as described above. Homology search results are reportedby ProbeSpec as a histogram of hits at various distances, rather than asa single record per hit, as is typically done by BLAST. The homologyproperties for a candidate probe can be evaluated or defined by a singleparameter, the Homology-Signal-to-background or HomS2B parameter, orwhen the HomS2B parameter is expressed on a log scale (defined below)HomLogS2B. The results from ProbeSpec can be processed to generate ahomology Signal to Background (HomLogS2B) score for each candidateprobe. The HomS2B parameter utilizes a plurality of homology informationon candidate probes to generate a theoretical signal to background scorefor each probe, where the signal is defined to be the signal obtainedfrom a single specific perfect match target (i.e. 100% homology), andthe background is the superposition of all the non-specifically bindingtargets from the complex mix of the whole genome, or the amplified orreduced complexity mix that may be present in the case of a complexityreduced assay.

A sample or target mixture with reduced complexity is one in whichcertain regions of the Genome are either selectively amplified orregions of the genome are physically separated from other regions. Theuse of reduced complexity target mixtures diminishes the level ofstringency required for the hybridization to be effective.

Calculating a HomS2B score utilizes a simple approximation of thehomology histogram of distances to generate the equivalent signal froman ensemble of targets competing with the probe. This is estimated bythe following formula:

${{HomS}\; 2\; B} \equiv \frac{S}{B} \approx \frac{N_{t}}{\left( {N_{0} - N_{t}} \right) + {\sum\limits_{d = 1}^{D}\; {P_{d}N_{d}}}}$

where N_(d) represents the histogram of homology hits at each distanced, where d is defined as the number of single-base differences between atarget and the probe.

N_(t) is the number of copies of intended targets. Two duplicatedregions in close proximity on the same chromosome (which occursfrequently) may be a viable example of where we would intentionallydesign probes complementary to more than one target. P_(d) is the signalpenalty associated with the distance, d. The penalty, though given hereas dependent only on the mis-match number (or “distance”), it may beassociated with the number of mismatches, the position of themismatches, the base composition and orientation of the mismatch, andthe nature of the mismatch (base-insertion, base-substitution, orbase-deletion). This is an approximation of the homology histogram ofdistances because the precise penalty is a function of the sequences ofboth the target and probe sequences for all near matches underconsideration. The approximation is based on the assumption that theaverage signal reduction across a large number of probe-targetmismatches is a reasonable representation for any given mismatch of thesame order. This approximation of the homology histogram of distancescan be either theoretically or empirically determined. The type ofmismatch and distributions of sequence mismatches affect the exactpenalty given. For example, single-point insertions and deletions of 5bases are generally less destabilizing than 5 single-base mismatchesthat are evenly distributed across the duplex. Or for that matter, thedistribution of mismatches across a duplex will likely cause substantialvariations in the penalty as well. The approximation of the homologyhistogram of distances may further comprise an additional step, whichassumes a constant penalty P for each base-distance. In this case, theapproximation P_(d)≅P^(d) is used.

Typically, the numerator, N_(t), is unity as in most cases there is asingle copy per chromosome of each target we are probing. Thedenominator includes a term representing the unintended target regionsthat are exactly complementary to the probe. The first N_(t) intendedtargets are represented in the numerator, whereas all other exact copiesare in the background, since they are not the intended targets.

The HomS2B parameter can be conveniently represented as a log normalizedto the penalty per base, as shown in the following formula:

${{HomLogS}\; 2\; B} \equiv \frac{\log \left( \frac{S}{B} \right)}{\log (P)}$

HomLogS2B expresses the signal to background score for a probe in unitscommensurate with distance from the nearest hit. That is, if there is asingle perfect match intended target to the probe of interest, nounintended perfect match targets, and a single significant backgroundtarget at a distance of d, HomLogS2B will equal −d.

Another approach to calculating the Signal to background score is tocalculate the Duplex Tm between the probe and each and every potentialcross-hybridization competitor during a homology search.

At event 210, the candidate probes are analyzed for kinetic properties.Because the arrays are generally hybridized to their targets for a timeless than that required to reach thermodynamic equilibrium, the observedsignals from hybridized targets depends on the rate at which theyhybridize, as well as on the thermodynamic stability of the duplex. Sometargets, because they are exceptionally long or contain secondarystructure, hybridize significantly more slowly than others, and aretherefore less desirable targets to probe. Kinetic properties of targetscan also be measured and evaluated empirically, in event 60 as furtherdiscussed below.

In silico properties which are very effective for eliminatingunacceptable probes are probe homology (HomLogS2B) and duplex-Tm, withhairpin free energy values. Probes which meet the in silico parameter(s)are selected in event 40 and those probes which do not meet the insilico parameter(s) are discarded or de-selected in event 42.

At event 50, candidate probes selected at event 40, are subject to apairwise selection process using an algorithm(s) which evaluatescandidate probes by both the region of the genomic range of interestthat they target and a specified probe property or characteristic (e.g.T_(m)). Candidate probes which pass the pairwise selection process areselected for a microarray in event 230. Probes which do not meet thepairwise criteria are discarded in event 232. The pairwise probeselection algorithm(s) reduces the number of probes associated with agiven region of the genomic range of interest while weighting probeselection towards a preferred parameter value. The method of applyingpairwise probe selection to candidate probes is more fully described inFIG. 6 below.

Candidate probes may be further subjected to a biased pairwise selectionprocess which uses an algorithm(s) which evaluates candidate probes bypairwise selection and these results are biased by a different probecharacteristic. For example, two properties are analyzed during pairwiseselection, e.g. the region of the genomic range of interest that thecandidate probes target and a specified probe property or characteristicsuch as homology score, followed by results evaluating or biasing theresults with another probe parameter, such as T_(m). The biased pairwiseprobe selection algorithm(s) reduces the number of probes associatedwith a given region of the genomic range of interest while biasing probeselection towards a preferred parameter value. The method of applyingbiased pairwise probe selection to candidate probes is more fullydescribed in FIG. 7 below.

Referring now to FIG. 4, there is shown a flow chart of the process indetermining probe properties of candidate probes empirically. Ingeneral, a plurality of indicators for probe performance can beempirically determined for candidate probes, using samples whoserelative copy numbers of complementary genomic sequences are equal. Anysample, when hybridized with the same sample as a reference, can servefor this level of empirical validation, whether the relative copynumbers of genes within the sample are known or unknown. The preferredsamples to use for validation, however, are normal diploid cells, forwhich the copy numbers of all targets are equal. These types ofexperiments are sometimes called “self-self experiments”.

At event 238, the selected candidate probes are placed on an“intermediate” or test array for experimental analysis. At event 240,the candidate probes are analyzed for raw signal intensity. Raw signalintensity is the background-subtracted signal, without normalization,reported for each feature in each channel scanned.

The signal strength of a candidate probe is an effective indicator ofprobe performance. While some experimentation is necessary to measuresignal strength, the experimentation does not require an independentlyvariable set of targets. A probe's signal, in the absence of and actualtarget number change, is thought to be related to target specificity apriori, since the more non-specific a probe is the more unwanted targetswill bind to that probe. For a set of probes selected to have the samemelting temperature (or narrow range of melting temperatures), whereeach specific target molecule is labeled with a single label molecule,such as with end-labeling methods, all probes should have approximatelythe same specific signal. For other labeling methods, the number oflabel molecules on the targets can vary, so that the specific signal ofa probe for a normal diploid sample also varies. However, if probes arechosen for targets within narrow ranges of sequence length and similarbase composition, a moderately narrow range of signal strengths forspecific probes is usually observed.

Besides direct binding of non-specific targets, another source ofnon-specific binding which contributes to excessive signals is so-called“sandwich cross-hybridization”. Since the targets are, in general,longer than the 60-mer probes, the probe-target duplexes often havedangling ends of single-stranded target. If the probe itself is specificto its respective target in one region of the target, but the target isnon-specific in one or more other regions, then the target will bindmany other labeled targets, and attach those indirectly to the probe. Anincrease in signal is seen relative tq the signals of specifically boundprobe-target duplexes. Probes subject to sandwich cross-hybridizationcan be screened in silica, or selected against in empirical validationevent 240.

At event 250, the reproducibility of signal intensity is determined forcandidate probes. The reproducibility of signal intensity is obtained byanalyzing a number of different samples, with differing cell types,separate amplification, digestion, and labeling reactions, on differentdays, and measuring the variation if signal intensity afterhybridization. Some probes are more sensitive than others to smallvariations in sample preparation conditions, and those probes areeliminated in this event. The more varied the samples and experimentalconditions the more robust the performance of the validated probes.

At event 260, the candidate probes are analyzed for Dye bias fortwo-color measurements. When identical samples are labeled withdifferent fluorometric labels, and hybridized together (a “self-self”experiment), the log ratios of signals in the two channels can differfrom zero either due to random variation, or because targets containingone label amplify, label, or hybridize more efficiently than targetscontaining the other label. This systematic difference between thesignals from identical targets containing different labels depends incomplex ways on the probe sequence. Probes that are particularlysusceptible to dye bias are identified by their reproducible deviationsfrom zero log ratios in replicate self-self experiments. Such probes areeliminated in event 260.

At event 270, candidate probes are analyzed for susceptibility tonon-specific binding. A probe's susceptibility to non-specific bindingis indirectly determined by the signal strength of the probe.Non-specific binding may also be determined in direct validationexperiments as described in other events.

At event 280, the candidate probes are analyzed for stability duringhybridization wash conditions. A relatively stringent wash step isnecessary to remove undesired targets that are only weakly homologous tothe intended target (and which therefore are not rejected by thehomology search score in event 220), but which are sufficiently abundantin the genome that appreciable numbers of hybrids are formed. The washis, however, not so stringent that it dissociates duplexes with thedesired complementary targets to a significant degree. Therefore, if theslide is rewashed and rescanned, the signals should not changesignificantly, since the non-specific targets have already washed off inthe first wash, and the specific targets don't wash off. Some probes,however, either bind unusually tightly to non-specific targets, or bindless strongly to their intended targets, so that their signals continueto decrease when rewashed. These probes are eliminated in event 280.

At event 290, candidate probes are analyzed for “persistence”.Persistence is an alternative measure of non-specific binding and isdefined as the ratio of intensity signals at long hybridization times,to intensity signals at short times. A persistent probe is one whichhybridizes steadily, following a bimolecular rate law. Non-persistentprobes show a rapid increase of signal at very short times, followed bya slow increase according to the usual kinetics. Persistence is aparameter/value which is complementary to the wash stability test ofevent 280. Persistence is measured by hybridizing the same sample toreplicate arrays for two or more different lengths of time, usually onefairly short time (e.g. one hour) and one more typical time (e.g. 24hours). Probes which show an excessive degree of prompt binding areeliminated in event 290.

The metrics for nonspecific binding presume that nonspecific signalarises from a large number of weakly-bound sequences, which can beselectively removed in stringent washes, and which bind more rapidly(due to their high concentration) but dissociate more quickly (due tolower binding constants) than the perfectly complementary targets ofinterest.

Important probe performance indicators, which are determinedempirically, are reference (normal sample) signal intensity, dye bias,wash stability, and persistence. At event 60, candidate probes areselected or give high performance scores if they meet the criteria setby the empirically measurable probe performance indicators utilized inevent 240, 250, 260, 270, 280 and 290. Candidate probes which do notmeet these criteria are not included in fmal CGH/location analysis probeselection at event 70.

Candidate probes which are rejected or discarded from the selectionprocess of the methods are more appropriately considered as probeoutliers. Candidate probe outliers are defined by population statistics,involving means, standard deviations, mediums, interquartile ranges andthe like, and are rejected based on one or a combination of probeproperties mentioned above. Outliers include but are not limited toprobes which are in the outer edges of a probe property distributions,that may exhibit compromised performance.

Referring now to FIG. 5, there is shown a flow chart of events useful indetermining probe properties of candidate probes directly byexperimentation. Direct measurement of probe performance, is determinedby validation experiments using samples whose relative copy numbers ofcomplementary genomic sequences are known a priori. At event 300,candidate probes, which have desirable probe properties determined insilico and/or empirically, are laid out on a prototype array. In certainembodiments, it may be useful to proceed with direct experimentationwith limited or no in silico or empirical prior data. The candidateprobes are placed on the array using array techniques known to thoseskilled in the art of microarrays. Array layout protocols includerandomization, periodic grid tiling, text-ordered tiling, and serpentinetiling. By making prototype “intermediate” arrays with more probes for agiven region of the genomic range of interest, than would be placed on afinal array design, those probes that behave best according to some setof metrics for probe performance for that particular region can beselected.

At event 310, the candidate probes on the array are hybridized tovarious target sets comprising known target sequence with known copynumber. A target set comprises a quantity of target molecules within themixture that is deterministically altered, or known to differ in awell-defined way from that of a “normal” target set sample. A pluralityof arrays are utilized to test various probe properties for theplurality of target sets.

In general, all subsets of the target sequences are altered in copynumber (or deleted altogether) without dramatically altering thecomposition of the rest of the genome. For example, two normal tissuesamples, one from a male tissue or cell line, and another from a femalecan be analyzed. Both will have the same number of target sequences foreach region of every chromosome (notwithstanding the usual polymorphicvariations) except the X-chromosome and the Y-chromosome. The malesample has a single copy of the X and Y chromosomes, whereas the femalewill have two nominally identical copies of the X-chromosome. So themale sample will have ½ the number of copies of the X-chromosome as doesthe female sample, and the female sample will have no copies of theY-chromosome targets. Probes for targets on the X-chromosome shoulddisplay, after normalization, twice as much signal for the female sampleas for the male sample. The fractional increase in the log base 2 of theobserved signal for a probe, when the copy number of its intended targetis doubled, is the “slope” of that probe. Ideally, probes should have aslope of 1.0. Probes with significantly and systematically differentslopes are inferior performers, and are issued low “differentialresponse” scores. The same approach can be used with cell lines of knownchromosomal copy number variations, where they can be found. It isunlikely that cell lines with alterations spanning the whole humangenome can be derived from naturally occurring variations (e.g.diseases). With such a set of samples, multiple measurements analogousto those of male to female signal ratios for each probe can be obtained.

Also, when the known copy number of a particular target sequence in asample is zero (as, for example, Y-chromosome probes in female samples),any signal observed for that probe must result from cross-hybridization.Probes that show significant signal for samples in which their knowncopy number is zero are scored low on the “cross-hybridization” score.

At event 320, the candidate probes are measured for probe performancefor each target set. The results of the hybridization experiment areanalyzed for a plurality of probe performance indicators which mayinclude but are not limited to, slope of response curve (a differentialresponse score), cross-hybridization, Y-axis intercept of response curve(equivalent to dye bias), reproducibility or noise, P-value ofseparability of distributions based on repeated measurements at two ormore target copy number values, variance of signals, and variance ofratios.

At event 340, candidate probes are scored and/ranked according to thevarious indicators or parameters for probe performance. The candidateprobes are scored for each target set tested.

At event 350, the experimental results obtained from each target set foreach candidate probe, are compared to validate candidate probes acrosstarget sets.

At event 360, the candidate probes are evaluated for adequatedifferential response across target sets. For example, probes may bechosen that give the slope closest to the theoretical slope for the setof samples. This is accomplished by simple filtering, such as byselecting a range of tolerable ratios, or by using a more complexalgorithm that uses the ratio information in conjunction with otherprobe information.

At event 370, candidate probes are evaluated for signal reproducibilityacross target sets. Signal reproducibility is determined in the samemanner as with the self-self validation experiments described in event250.

At event 80, candidate probes which have been validated by the probemetrics determined experimentally are passed to the next step of theselection process or may be selected for placement on a CGH array.Candidate probes which are not validated are discarded at event 90.

Often during probe design, far more candidate probes are generated thanthe number of probes that are actually needed to cover a given region ora given gene. Generally, it is desired to have uniform spatial coverageover the chromosome, or over some region of interest. However, there areother parameters other than spatial coverage that are desirable, whichare used to bias probe selection. For example, where there are morecandidate probes than resources to search with, probes may be sacrificedaccording to other parameters. With the knowledge that probes with lowerT_(m)'s often behave better than high-T_(m) probes, candidate probes ina certain region of interest may be analyzed or biased by duplex-T_(m)probe values. The pairwise probe-selection allows for the filtering ofcandidate probes within a region of interest by a specific probeproperty.

Pairwise filtering is utilized in some embodiments to filter candidateprobes to generate probe sets for intermediate arrays. The pairwisefiltering is used on candidate probes within genes (on a gene-by genebasis) to reduce the number of probes per gene to a small reasonablyuniformly spaced set, while simultaneously selecting higher scoringprobes based on in silico parameters, and perhaps also biasing (biasedpairwise filtering) these results for whether the probe is in an exonrather than an intron.

Pairwise filtering is also useful in identifying probes withinintergenic regions to provide a somewhat uniform coverage between genes.The target density across intergenic regions may be set prior topairwise filtering by the probe designer. Biased pairwise analysis mayalso be utilized within intergenic regions, biasing towards tolesser-quality genes, mRNAs, transcripts, psuedogenes, est's, or exonicregions. In some embodiments, all of the intergenic regions of achromosome may be pairwise filtered together, unlike the genes of achromosome which are generally pairwise filtered separately.

Referring now to FIG. 6, is a flow chart of the events for oneembodiment of the pairwise process for analyzing candidate probes forCGH arrays in accordance with the invention. At event 380, a probe set(e. g. a set of candidate probes within a gene or chromosome ofinterest) and a probe property are selected for pairwise analysis. Anexemplary probe property is the duplex melting temperature of thecandidate probes, designated as T_(i) for each probe. Along with theprobe property, an optimal parameter T_(o) value (e.g. the average valueof that property among all the candidate probes) is determined. At event382, a single combined score value is generated that integrates theprobe properties of interest weighted by their importance or utility inpredicting good probe performance and all probes are marked as viable atevent 384.

At event 390, the genomic distances d_(ij) between neighboring viablecandidate probes within the region of interest, (e.g. on a specifiedchromosome, or gene of interest) is determined. “Genomic distance” meansthe number of nucleotide bases separating the two probe positions on thechromosome sequence of interest. The criteria for determining distancesinclude but are not limited to; the distances between pairs ofneighboring probes or the average distance of each probe from its twoneighbors.

At event 400, the genomic distances between neighboring viable probesare determined, probes N with genomic distances less than a distance Dare identified. The candidate probes are analyzed repeatedly fordistance measurements until there are no remaining closely spaced probesi.e. d_(i)<D. Two neighboring probes spaced less than a distance D, aregiven preferential consideration over probe neighbors not meeting thiscriterion. Candidate probes are sorted from smallest distance betweenneighbors to largest genomic distance in the embodiment shown in FIG. 6,at event 400.

At event 410, candidate probes of the probe set are analyzed for duplexT_(m) properties. The duplex T_(m) is determined for each probe within apair using established predictive formulas. In certain embodiments, thepair of probes may be analyzed for a plurality of properties other thanTm or in combination with Tm determination. In FIG. 6, the probes havinga duplex T_(m) value further from T_(o) than that of their neighboringprobe are flagged for elimination from the candidate probe set at event420.

The process of analyzing a probe pair in event 410 and 420 is repeated apredetermined number of times as a matter of efficiency at event 430.

The duplex T_(m) analysis is continued on the next probe pair at event432. The next probe pair may be either; the next probe pair in order onthe chromosome (region of interest), the next pair with the most closelyspace probes (e.g. comprise the smallest gap between probes), or thenext two probes with the largest gap size. In the embodiment FIG. 6, thenext neighboring pair to be analyzed for Tm values is the next pair withthe largest distance.

At event 440, all of the probes flagged for elimination from the viablepool of probes in the region of interest are removed from the probe set.After one round of analysis based on the chosen probe property, i.e.T_(m) in this example, event 390, 400, 410, 420, 430 and 440 arerepeated in event 450 until all probes are have met the minimal distancecriteria, or until the desired number of probes is achieved. In thesubsequent rounds of pairwise analysis the probe neighbors change due tothe elimination of some probes not meeting the distance criteria or theaccepted values for the probe indicator selected , i.e. Tm. Exemplaryprobe indicators useful in pairwise analysis may include all of theprobe selection criteria described above. In event 460, the remainingviable probes with appropriate distance parameters and the best valuesfor the probe property or properties tested are selected.

An example of the value of the pairwise analysis is shown in FIGS. 7 and8. FIG. 7 is a graph of the T_(m) distributions of all candidate probesfor chromosome 16 prior to pairwise analysis. It should be noted that,due to the widely varying number of probes in the distributions, thevertical scales are given in terms of “Relative Probe Density,” wherethe sum of all bars in each distribution is normalized to 1 in order tomake the shapes of the distributions more apparent. The distribution for60-mer candidate probes within intronic and intergenic regions is shownin white (808,285 probes) while the distribution of the candidate 60-merprobes within exon regions (about 4% of all) is shown in black (35,694probes). The distributions of T_(m)s for all probes are fairly broad(about 20 degrees), and the average T_(m)'s differ between intronic andexonic probes by more than 15 degrees Celsius.

After applying a pairwise filtering which preferentially selected forprobes with duplex T_(m) values within a predetermined delta T_(m)range, the probe set shown in FIG. 7 has be reduced in sizeconsiderably. The number of “surviving” candidate probes is reduced170-fold from about 840,000 to 5,000 and the new distribution is shownin FIG. 8. This means that, on average, each probe remaining in theselected set was considered for elimination on at least seven occasions,and was kept each and every time. FIG. 8 is a graph showing the T_(m)distributions of filtered probes for chromosome 16, where the filteringuses the pairwise probe selection algorithm, to reduce the number ofprobes while dramatically reducing the width of the T_(m) distributions.The filtered distribution for probes within intronic and intergenicregions is shown with black bars (4,000 probes) while the distributionof the candidate probes within exon regions as indicated by white bars(976 probes). The filtered T_(m) distributions shown in FIG. 8, with afull-width at half-maximum (FWHM) of less than 1 degree Celsius, aremuch narrower than their respective candidate probe distributions ofFIG. 7. This pairwise analysis also demonstrates an enrichment ofexpressed sequences from about 4 percent of candidate probes to about 20percent of filtered probes. It should be noted that while the pairwisealgorithm is being used to select probes for CGH arrays, it may also beused for selecting probes of various array systems including but notlimited to gene expression arrays. Some embodiments of the methods ofthe invention make use of an additional probe property to bias pairwiseprobe selection, for example in addition to the genomic distance and aprobe property, such as duplex T_(m).

An example of the results of a pairwise analysis biased towards probeswithin an exon is shown in FIG. 9, as discussed above with regards to anexample on chromosome 16. FIG. 9 shows a plot of the duplex meltingtemperatures of two sets of probes, in a narrow zoomed-in region (from31 Mbp to 31.5 Mbp) plotted as a function of position along chromosome16. The candidate probes are indicated as black dots, and the otherselected probes comprising 5000 probes selected by the biased pairwisealgorithm are indicated by the square and circular markers. The squaremarkers in FIG. 9 correspond to selected expressed probes (probes fortargets in exons), whereas the circular markers indicate probes selectedin intragenic or intronic regions. Also shown along the bottom axis arepoints indicating the positions of probes that are within exon regions.The horizontal line at 80 degrees indicates the target temperature.Despite the reduction in mean temperatures from about 90 degrees (forexpressed probes) to about 80 degrees, it can be seen that a substantialfraction of probes are selected within exons in this example and thatmany of the probes have temperatures quite near the target number of 80degrees. Additionally, it can be seen that the selected probes arereasonably uniformly spaced.

It should also be noted that while the pairwise and biased pairwisefiltering processes are being used to select probes for CGH arrays inthe embodiment described above, they are also useful for selectingprobes of various array systems including but not limited to locationanalysis, expression arrays, and the like.

The of probes utilized for the microarrays of the instant invention wereselected by the methods described above. The microarrays of theinvention comprise a solid support in which a plurality of the selectedpolynucleotide probes are bound to the surface, or attached to the solidsupport of the array. Techniques known to those skilled in the art ofmicrorarray manufacturing are utilized in attaching the probes to thesolid support. The plurality of polynucleotide probes attached to thesupport have a corresponding plurality of different nucleotidesequences. FIG. 10 is a block diagram of a microarray comprising a solidsupport 480 and a plurality of nucleotide probes, 490A, 490B and 490Cattached to the solid support. 490A, 490B and 490C are onlyrepresentative of the probes on the array and the dots, 492, on FIG. 10indicate that at least 1,000 probes may be placed on the solid support.The number of probes placed on the array may range from about 1,000 toabout 50,000, more particularly from about 10,000 to about 40,000depending on the intended use of the microarray. In certain embodimentsthe number of probes on the array may be about 1,000, 2,000, 5,000,10,000, 20,000, 30,000 or about 40,000. The nucleotide probes on themicroarrays of the invention were selected for specific probe propertiesdetermined by the methods of the invention.

The probes of the microarray comprise similar thermodynamic propertiesidentified by the methods of the invention. A large percentage of theprobes bound to the solid support comprise a duplex Tm value which fallswithin a very narrow Tm distribution or delta Tm of about 0.25° C. toabout 5° C., usually about 0.25° C. to about 3° C., more usually 0.25°C. to about 2° C. Delta Tm is defined as a temperature distribution inwhich Tm median is approximately in the center of the distribution.Probes which are within the delta Tm may have a duplex Tm greater thanthe median Tm−(delta Tm)/2 but less than median Tm+(delta Tm)/2. Most ofthe melting temperatures spanned by the delta Tm usually fall within thetemperature range of about 65° C. to about 90° C. when calculated by themethod described in J Breslauer et al. Proc Natl Acad Sci. (PNAS) 1986June; 83(11): 3746-3750, where the target and probe concentrations areboth 0.1 pM and the salt concentration term is set equal to zero. Thus,consideration must be given to the experimental conditions in whichduplex Tm values have been calculated.

A microarray having probes which have very similar Tm values, has beenshown experimentally, to be very effective in CGH and location analysis.

The percent of probes which have a duplex Tm value within a delta Tm ofabout 5° C. degrees is from about 60% to about 99%, usually about 80% toabout 99%, and more usually about 95% to about 99%. The percent ofprobes which have a duplex Tm value within a delta Tm of about 4° C.degrees is from about 50% to about 99%, usually about 80% to about 99%,and more usually about 90% to about 99%. The percent of probes whichhave a duplex Tm value within a delta Tm of about 3° C. degrees is fromabout 50% to about 99%, usually about 80% to about 99%, and more usuallyabout 90% to about 99%. The percent of probes which have a duplex Tmvalue within a delta Tm of about 2° C. degrees is from about 50% toabout 99%, usually about 80% to about 99%, and more usually about 90% toabout 95%. The percent of probes which have a duplex Tm value within adelta Tm of about 1.5° C. degrees is from about 50% to about 99%,usually about 70% to about 95%, and more usually about 80% to about 90%.The percent of probes which have a duplex Tm value within a delta Tm ofabout 1.0° C. degrees is from about 50% to about 99%, usually about 70%to about 95%, and more usually about 70% to about 85%. The percent ofprobes which have a duplex Tm value within a delta Tm of about 0.5° C.degrees is from about 50% to about 99%, usually about 50% to about 90%,and more usually about 50% to about 80%.

The majority of probes attached to the solid support have a duplex Tmvalue ranging from about 65° C. to about 85° C., usually from about 75°C. to about 85° C., more usually from about 78° C. to about 82° C. Tmvalues for a particular probe may varying due to the salt concentrationin the probe solution, target concentration, probe concentration as wellas other factors. In one embodiment, the percent of probes on the solidsupport which have a duplex Tm value between 65° C. to about 85° C. isabout 90% to about 100%. In another embodiment, the percent of probes onthe solid support which have a duplex Tm value between 75° C. to about85° C. is about 90% to about 99%. In yet another embodiment, the percentof probes on the solid support which have a duplex Tm value between 77°C. to about 82° C. is about 85% to about 99%. Other embodiments may havea percent of the probes having a duplex Tm value between 79° C. to about81° C. from about 85% to about 98%.

Table 1 gives the Tm values and Tm distributions for probes for oneembodiment of a microarray of the invention. In this embodiment,microarray HGA1.1, has approximately 40,000 unique probes which havevery similar Tm values. Table 1 shows that the majority of the probes(92%) of the HGA1.1 have a Tm value between 79° C. and 81° C. During theprobe selection process for the HGA1.1 array, the Tm distribution forthe probes was set narrowly around a median Tm of 80° C. which filteredout a larger number of candidate probes which did not fall within thespecified Tm range. Table one also demonstrates the use of a duplex Tmcutoff temperature, in this embodiment, candidate probes with a Tmgreater than 81° C. were not selected.

Referring now to FIG. 11, is a graph comparing the T_(m) distributionsof probes of a microarray in accordance with the invention, to thecandidate probes utilized to make the microarray as well as probes froma typical expression array. The HGA1.1 array, as described above, is oneembodiment of the invention and is represented as solid line in FIG. 11,while HOCDD, an expression array is shown as a dotted line and thecandidate probes, prior to T_(m) selection are shown as a dashed line.Both arrays comprise approximately 30,000 to 40,000 probes of 60 basesin length, while approximately 30 million non repetitive candidateprobes prior to filtering are also shown in FIG. 11. FIG. 11 shows thatthe selected probes utilized in fabricating the microarrays of theinvention, in general, have a very narrow T_(m) range in comparison tothe expression array or the initial candidate probes. The T_(m) medianfor the exemplary microarray, HGA1.1, shown on FIG. 11 is about 80° C.,which corresponds to 36% GC content, using the Tm calculation utilizedby the inventors. The fraction of probes within a differential or deltaTm range for the two arrays as well as all the candidate probes is shownin the graph of FIG. 12. FIG. 12 emphasizes the large number of probesutilized for the microarrays of the invention with similar T_(m) values.Over 90% of the probes of HGA1.1 are within a 2 degree T_(m)differential, while less than 20% of the probes of the HOCDD expressionare within a 2 degree T_(m) differential. Both FIGS. 11 and 12,demonstrate one of the novel features of certain embodiments of themicroarrays of the invention.

TABLE 1 Tm values of probes on HGA1.1 microarray. Tm Range TmDistribution of probes 65 0 65.5 0 66 0 66.5 0 67 0 67.5 0 68 0 68.5 069 1 69.5 3 70 2 70.5 5 71 8 71.5 13 72 20 72.5 24 73 29 73.5 51 74 5374.5 63 75 97 75.5 114 76 153 76.5 212 77 262 77.5 360 78 586 78.5 90479 1668 79.5 10658 80 12769 80.5 7385 81 4535 81.5 0 82 0 82.5 0 83 083.5 0 84 0 84.5 0 85 0

In certain embodiments, probes attached to a support have a nucleotidelength ranging from about 20 nucleotides to about 100 nucleotides,usually about 40 nucleotides to 70 nucleotides, and more usually about50 to 65 nucleotides in length. In some embodiments all the probes onthe array have the same length, for example a length of about 60nucleotides. In other embodiments the about 40% to about 60% of all theprobes have a length of 60 nucleotides. During the probe selectionprocess described in the methods of the invention, some of the probesare trimmed or shortened to change the T_(m) of the probe so that itfalls within a predetermined delta T_(m) range. Thus some of the probeson the array may be shorter than others.

In certain embodiments, the probes of the microarray comprise similar GCcontent identified by the methods of the invention. A large percentageof the probes bound to the solid support comprise a % of GC contentwhich falls within a very narrow % GC distribution or delta % GC of lessthan about 10%, usually about 5%, and more usually 3%. Delta % GC isdefined as a distribution of % GC content of probes, in which the % GCcontent median is approximately in the center of the distribution.Probes which are within the delta % GC may have a GC content greaterthan the median % GC−(delta % GC)/2 but less than median % GC+(delta %GC)/2. The delta % GC usually falls within the GC content range of about30% GC content to about 50% GC content for a given probe. A microarrayhaving probes which have very similar % GC content, has been shownexperimentally, to be very effective in CGH and location analysis.

The percent of probes which have a % GC content within a delta % GC ofless than 10 is from about 60% to about 99%, usually about 70% to about90%, and more usually about 80% to about 90%. The percent of probeswhich have a % GC content within a delta % GC of less than 5 is fromabout 40% to about 90%, usually about 60% to about 90%, and more usuallyabout 70% to about 90%. The percent of probes which have a % GC contentwithin a delta % GC of less than 3 is from about 35% to about 80%,usually about 40% to about 70%, and more usually about 40% to about 65%.

In one embodiment, about 60% to about 99% of the polynucleotide probesattached to the solid support have a % GC content from the range of 30%to 40%. In another embodiment, about 60% to about 95% of thepolynucleotide probes attached to the solid support have a % GC contentfrom the range of 34% to 40%. In yet another embodiment, about 70% toabout 90% of the polynucleotide probes attached to the solid supporthave a % GC content from the range of 34% to 40%.

FIG. 13 Referring now to FIG. 13, is a graph comparing the GC contentdistributions of probes of a microarray in accordance with theinvention, to the candidate probes utilized to make the microarray aswell as probes from a typical expression array. The HGA1.1 array, asdescribed above, is one embodiment of the invention and is representedas solid line in FIG. 13, while HOCDD, an expression array is shown as adotted line and the candidate probes, prior to GC content selection areshown as a dashed line. Both arrays comprise approximately 30,000 to40,000 probes of 60 bases in length, while approximately 30 million nonrepetitive candidate probes prior to filtering are also shown in FIG.13. FIG. 13 shows that the selected probes utilized in fabricating themicroarrays of the invention, in general, have a very narrow delta ordifferential % GC content range in comparison to the expression array orthe initial candidate probes. The fraction of probes within adifferential or delta % GC content range for the two arrays as well asall the candidate probes is shown in the graph of FIG. 14.

FIG. 14 emphasizes the large number of probes utilized for themicroarrays of the invention with similar % GC content values.Approximately 80% of the probes of HGA1.1 are within a 5% GC contentdifferential, while less than 40% of the probes of the HOCDD expressionarray are within a 5% GC content differential.

In some embodiments the probes on the array target various regions ofthe human genome such as exons, introns, regulatory regions andintergenic regions. In other embodiments the probes target a differentgenome than human, such as mouse, rat, etc. In one embodiment, at least30% of the probes which target exons. In another embodiment, at least 5%of the probes of the microarray target regulatory regions. In yetanother embodiment, at least 50% of the probes on the support targetintergenic regions.

While the microarrays of the invention are useful for CGH and locationanalysis they are not limited to these types of analysis. For example,the microarrays of the invention may be utilized expression analysis aswell.

FIG. 15 illustrates a typical computer system 500 that may be used inprocessing events described herein. The computer system 500 includes anynumber of processors 502 (also referred to as central processing units,or CPUs) that are coupled to storage devices including primary storage506 (typically a random access memory, or RAM), primary storage 504(typically a read only memory, or ROM). As is well known in the art,primary storage 504 acts to transfer data and instructionsuni-directionally to the CPU and primary storage 506 is used typicallyto transfer data and instructions in a bi-directional manner Both ofthese primary storage devices may include any suitable computer-readablemedia such as those described above. A mass storage device 508 is alsocoupled bi-directionally to CPU 502 and provides additional data storagecapacity and may include any of the computer-readable media describedabove. Mass storage device 508 may be used to store programs, data andthe like and is typically a secondary storage medium such as a hard diskthat is slower than primary storage. It will be appreciated that theinformation retained within the mass storage device 508, may, inappropriate cases, be incorporated in standard fashion as part ofprimary storage 506 as virtual memory. A specific mass storage devicesuch as a CD-ROM 514 may also pass data uni-directionally to the CPU.

CPU 502 is also coupled to an interface 510 that includes one or moreinput/output devices such as such as video monitors, track balls, mice,keyboards, microphones, touch-sensitive displays, transducer cardreaders, magnetic or paper tape readers, tablets, styluses, voice orhandwriting recognizers, or other well-known input devices such as, ofcourse, other computers. The CPU 502 optionally may be coupled to acomputer or telecommunications network using a network connection asshown generally at 512. With such a network connection, it iscontemplated that the CPU might receive information from the network, ormight output information to the network in the course of performing theabove-described method steps. Finally, a data base 516 isbi-directionally coupled to network connection 512 for data storage andretrieval for information pertaining to the probe design process andmicroarray fabrication. The above-described devices and materials willbe familiar to those of skill in the computer hardware and softwarearts.

The hardware elements described above may implement the instructions ofmultiple software modules for performing the operations of thisinvention. For example, instructions for population of stencils may bestored on mass storage device 508 or 514 and executed on CPU 508 inconjunction with primary memory 506.

In addition, embodiments of the present invention further relate tocomputer readable media or computer program products that includeprogram instructions and/or data (including data structures) forperforming various computer-implemented operations. The media andprogram instructions may be those specially designed and constructed forthe purposes of the present invention, or they may be of the kind wellknown and available to those having skill in the computer software arts.Examples of computer-readable media include, but are not limited to,magnetic media such as hard disks, floppy disks, and magnetic tape;optical media such as CD-ROM, CDRW, DVD-ROM, or DVD-RW disks;magneto-optical media such as floptical disks; and hardware devices thatare specially configured to store and perform program instructions, suchas read-only memory devices (ROM) and random access memory (RAM).Examples of program instructions include both machine code, such asproduced by a compiler, and files containing higher level code that maybe executed by the computer using an interpreter.

While the present invention has been described with reference to thespecific embodiments thereof, it should be understood by those skilledin the art that various changes may be made and equivalents may besubstituted without departing from the true spirit and scope of theinvention. In addition, many modifications may be made to adapt aparticular situation, material, composition of matter, process, processstep or steps, to the objective, spirit and scope of the presentinvention. All such modifications are intended to be within the scope ofthe claims appended hereto.

Examples

The following examples are put forth so as to provide those of ordinaryskill in the art with a complete disclosure and description of how tomake and use the present invention, and are not intended to limit thescope of what the inventors regard as their invention nor are theyintended to represent that the experiments below are all or the onlyexperiments performed. Efforts have been made to ensure accuracy withrespect to numbers used (e.g. amounts, temperature, etc.) but someexperimental errors and deviations should be accounted for. Unlessindicated otherwise, parts are parts by weight, molecular weight isweight average molecular weight, temperature is in degrees Celsius, andpressure is at or near atmospheric.

Example I

This example utilized a single array design, with 22000 features, andincluded candidate probes for five chromosomes: Chr16, Chr17, Chr18,ChrX and ChrY. The test array had approximately three times as manyprobes, for each chromosome, as were eventually needed. The purpose ofthe test experiments was to select the best probe, according to thecriteria described above, from these 3-fold redundant probe candidates.

The probes were designed and selected for five chromosomes, Chr16,Chr17, Chr18, ChrX and ChrY. Each chromosome was analyzed separately asa genomic range of interest, and probes for that chromosome which metthe in-silico probe criteria (event 40 of FIG. 1) were placed onto asemi-final array. Below are the events which were carried out toidentify candidate probes to be placed on a semi-final array.

Restriction digest of genomic sequences: Each chromosome was subjectedto a computational process that models the restriction digest of thehybridization assay by cutting the chromosomal DNA sequences in silicoat the sites which would be cut in the experimental assay. In thisexample we used the restriction enzymes, Rsal, which cuts at the “|”inGT|AC and Alu1, which cuts at AG|CT. Each cut site is identified in thesequence, allowing these sites to be omitted from potential probesequences, thus decrease the computational time needed to analyze theprobes.

Target Fragment filtering: Here potential target sequences were filteredby size, number of repeat-masked bases and/or GC-content. Targetfragments shorter than a nominal probe length, i.e. <60 by wereeliminated. Fragments with a length greater than 800 by were alsoremoved from the target fragment set. The latter cutoff was determinedby a visual inspection of the longer target sequences for repetitions.Many, if not most sequences>800 by have a relatively low sequencecomplexity that is often obvious on visual inspection. Target fragmentsthat were largely repeat masked were also removed from the targetfragment set.

Candidate Probe Tiling: Possible probes were then tiled acrossnon-repeat-masked regions. 60-mer candidate probes were tiled in stepsof 30 bases. Although these candidate probes are generated as 60-mers,some of these probes were subsequently shortened to anywhere from 30 to59 bases in latter steps. The probe tiling procedure initiates a newprobe at the first non repeat-masked base within a target sequence, if arepeat-masked section is encountered, the tiling procedure is restartedat the next first non-repeat masked base. The tiling process was carriedout for each chromosome of interest and allowed for the generation ofcandidate probes for each of the five chromosomes.

Thermodynamic Annotation of Candidate Probes: After the generation ofcandidate probes, probes were annotated with their estimated duplexT_(m), the melting temperature of a probe forming a duplex with itsperfect-match target sequence, in silico. This quantity is useful as anindication of the GC-content of the probe and a measure of the duplexstability.

Annotation of probes for expression and association with genes: Afterannotating the candidate probes by duplex T_(m), a number of databaseswere used to determine probe sequences which are contained withinintrons or exons of either known genes or predicted genes. Messagealignments to genomic sequences were identified, and probes wereanalyzed for whether they were either partially or wholly containedwithin the confines of the exons of the messages. Probes were annotatedindependently for whether they are in intron regions or within thebounds of a message or within a “premRNA”.

Exonic Probe Sets: The candidate probes were then filtered forexpression and duplex T_(m). Under our experimental conditions, higherduplex T_(m) probes were found to behave quite poorly relative to lowerT_(m) probes. Very low T_(m) probes form unstable duplexes and hencehave much lower signals. As a result, those probes with duplex T_(m)'sbelow about 65 degrees and above 90 degrees were filtered out. Themedian T_(m) for all probes was between 79 and 80 degrees. After theT_(m) filtering, candidate probes which center or middle nucleotidebases lie within the limits of any exon of known and predicted genes andmRNAs were considered.

A homology search was completed for the remaining filtered candidateprobes in expressed regions, thus producing an exonic ProbeSet. Thesehomology search results were integrated with other probe parameters andfor annotation of the exonic ProbeSet.

Intronic Probe Sets: Candidate probes which were in introns of knowngenes, (intronic ProbeSet) mRNAs and predicted genes were selected butat a somewhat reduced density then was used for the exonic ProbeSet.This step involves the use of the pairwise selection algorithm, whichanalyzes neighboring probes, and selects/scores one probe over the otherby analyzing the neighboring probes for a certain probe parameter.

The size of the intronic Probe set was chosen so the homology search canrun in about 2-days per chromosome on a single CPU. A homology searchwas run in the expressed regions for the intronic ProbeSet. The homologysearch results were integrated with other probe parameters forannotation of the intronic ProbeSet by those parameters.

Uniformly-spaced Probe Set: Another probe set was produced in which thenumber of probes is reduced from the broad candidate set, by eliminatingprobes within known genes and thinning to create a fairly low-densityprobe set that covers the whole genome. This probe set was chosen to beof such a size that each chromosome could be homology searched in lessthan 2 CPU days. After pairwise filtering to remove closely spacedcandidate, a homology search was conducted and the search results wereintegrated with other probe parameters for annotation of the intergenicProbe Set.

Gene by Gene Probe selection: The final probe set for the whole humangenome array will comprise a plurality of probes for each known humangene. To accomplish this, the following events were carried out. Thehomology searched exonic, intronic and intergenic probe sets werecombined, and these presets were filtered for homology, keeping onlythose probes which were 60-mers. The 60-mer probes were annotated formembership in either of two public transcript databases: knownGene andRefSeq. The messages (i.e. transcript sequences) were analyzed, and foreach message a set of candidate probes were selected. If the message wasspanned more larger distances on the genome (say more than 200,000 bp),then about 6 or 7 candidate probes were selected. For shorter messages,the candidate probe set only contained 3 to 5 probes. If there were moreprobes within the boundaries of the gene (or message), than the targetnumbers from 4 to 7, the probes were thinned to the desired number byusing the biased pairwise probe filtering selection within the genomicboundaries (as described above) in a gene-by-gene fashion. These numberswere used to create a 6-array set of single density arrays (about 20,000probes per array). These were subsequently pared down (using indirectvalidation methods) to 40,000 probes for a double-density array,reducing the number of probes to one per gene and three per genebelieved to be associated with cancer genomics. The remaining realestate on the array was filled with appropriately pairwise filteredprobes from the uniformly-spaced intergenic probe set discussed above.

In this design process, the bias was towards exonic regions overintronic, with a T_(m) bias of 5 degrees from the desired optimum T_(m).If the duplex T_(m) of the exonic probe was more than 5° C. further fromthe optimum T_(m) than that of its intronic neighbor, the intronic probewas selected, otherwise the probe within the exon was chosen. Across thegenome this bias increased the likelihood of picking probes within exonsfrom about 3% to about 50%.

In general, if the desired number of candidate probes were equal to thenumber of filtered probes in the given region, then they were simplyselected. If there was a lesser number in the gene region, then theregion-of-interest was expanded into flanking regions in steps of 2,000bp, until the desired number of candidate probes was reached, or until amaximum distance of 20 kB by was analyzed, and at which point, the genewas flagged as “missing”. A more exhaustive search for probes wassubsequently carried out for missing genes as described below.

There are two main reasons that good probes may not be found within agiven gene, message, or region. The first is that the region may beexceptionally GC rich or high in duplex melting temperature (and insecondary structure). For example, it has been determined that expressedregions, such as introns have substantially higher melting temperaturesthan unexpressed regions. The second reason that good probes may not befound within a given gene, message, or region is that the region may behighly homologous to one or more other regions of the genome. A goodexample of this is the pseudo-autosomal regions of the X and Ychromosomes. These regions from 0 to 2.5 Mbp and from 87 to 91 Mbp onchrX are duplicated on chrY. Probes selected in this gene-by-geneprocess were joined into a new table of “semi-final probes” for placingon an array for further validation and testing.

Iteration for Missing Genes: For those genes or messages flagged as“missing” in the previous step, the homology search wasintensified/expanded to help identify these genes. The originalunfiltered candidate probe set generated above, was revisited. For eachof the genes or messages missed in the first round, 400 new candidateprobes were selected, upon which to perform a new homology search. Thesecandidate probes were chosen, first within the gene (or message) regionof interest, and then in flanking regions of the gene in steps ofseveral thousand base pairs. Then a homology search was carried out onthe newly identified probes. The homology search results were joinedwith other probe set annotation, and then these probe sets were combinedwith the exonic, intronic and intergenic probe sets.

The stringency of the homology filter was reduced to HomLogS2B for thoseprobes with a homology score of less than 10 (e.g. an equivalentdistance of 10 bp). The amount to reduce the homology filter wasdetermined from obtaining good validation results for probes withhomology scores in the range of 10-20. Probes were searched as statedabove, but this time the search was expanded to a maximum limit of400,000 by beyond the limit of the gene of interest.

“Wasteland” probes: The selection of probes at a more reduced densitybetween known and predicted genes, the intergenic ProbeSet, is sometimescalled the “wasteland” probe set. Genes are not uniformly spaced acrossthe genome or within any given chromosome, consequently, despite a majorinterest in selected probes for genes, or gene-regions, it is valuableto have some low-density probe coverage for the intergenic regions.These intergenic regions may be of interest either because of theirregulatory value or for purely exploratory research purposes. For thisreason, a number of candidate probes (300-800/chromosome) were reservedfor intergenic regions of each chromosome, depending of the length ofthe chromosome. These regions include but are not limited to predictedgenes as well as mRNA not yet associated with known genes. Of course,arrays for location analysis are specifically designed to focus onintergenic regions, especially those regions just upstream of a genes5′-end.

The process for selecting wasteland probes is described below. Theuniformly-spaced probe set produced above is filtered for probes with aHomLogS2B score greater than 19. Any previous expression annotation iscleared and the filtered probe set is “reannotated” for membership ineither of the two public transcript databases, knownGene and/or RefSeq.If the probes are within these known genes, they are removed from thewasteland ProbeSet. The remaining probes are than analyzed and annotatedfor mRNAs and predicted genes.

The probe set is then thinned to approximately 300 to 800 probes fromabout 130,000 filtered probes initially using the biased pairwiseanalysis. The biased pairwise analysis was based on the length of thechromosome, while biasing probe selection toward a central T_(m) of 80degrees as well as towards expression (e.g. biased toward mRNAs andpredicted genes) with an additive bias of 5 degrees Celsius.

Probes for pseudo-autosomal regions of Chromosomes X and Y: It is wellknown among genomic biologists that there are regions on chromosome-Ythat are virtually identical to chromosome X. These regions are calledpseudo-autosomal despite the fact that they exist on very differentchromosomes, because they contain the same genes, and these genesnominally manifest themselves in two copies per cell, independent ofsex. For this reason, these regions were represented on a prototypearray. Pseudo-autosomal regions are found mainly in two places in thehuman genome; the first place is about 2.5 megabases of both ChrX andChrY, and the second is from 87-91 Mbp on chrX. 300 probes were selectedto span this region more or less uniformly and biased towards expressionwith the pairwise algorithm. Out of the 300 probes, 53 probes wereselected within exons, and 215 in introns.

Finalizing the probe data set: The probe data set was finalized bycollecting all of the identified probes sets mentioned above andevaluated as follows. Firstly, the gene-by-gene probe set, the missinggene probe-set, and the wasteland probe-set described for allchromosomes were combined into a single table. The prior expressionannotation was reset (cleared) and the probes were reannotated with genenames in reverse or of database validation, predicted first, mRNAs,MGCgenes, knownGene, and RefSeq. The genes on the forward strand as wellas the reverse strand are kept for each probe.

For probes that are within exons on the reverse strand but not withinexons on the forward strand, the complement of the reverse strandsequence was taken. This allows one array to be used for bothgene-expression as well as for CGH. The probes within a message maystill be viable for expression arrays, although their distances from the3′ ends of the message will vary.

60-mer probes were trimmed or shortened in length as necessary toachieve a T_(m) below 81 degrees. Probe performance for 60-mer probes isdramatically better for probes with low T_(m) (below about 80 degrees)than probes with a higher T_(m). It has been demonstrated empiricallythat the performance of high T_(m) 60-mer probes can be improvedsomewhat by shortening them.

The trimming process for each high T_(m) 60-mer probe (typically chosenin an expressed region) involves searching each probe sequence for thelongest sequential subsequence with a melting temperature less than acutoff, typically 81 degrees, and utilizing that subsequence instead ofthe full length 60-mer sequence.

Array Layout: Finally, the selected probes for each chromosome wereplaced on an array. This was completed by exporting the selected probesto a text file in table form in a 4-field format and loaded into theAgilent's Array Wizard internal software for generating the new arraydesign.

Example II

In this example, validation experiments were completed using male andfemale samples, and cell lines containing non-naturally occurring copynumbers of X, such as 3X, 4X and 5X. The experiments were conductedusing an array specifically designed for CGH, designated D4CGH1, whichincluded over 2000 probes designed specifically for chromosome X. Thesecandidate probes were made in three lengths 60-mers, 45-mers and30-mers. Male and female samples were hybridized on these arrays inreplicate experiments. From these data, the slope (i.e. the separationbetween the distributions of X-chromosome probes, which have a nominallog2 ratio of −1 for male vs. female samples, and the autosomal probes,which have a nominal log2 ratio of 0), and noise measurements (i.e. thewidths of the distributions of X-chromosome probes and of autosomalprobes) were calculated, and subsequently, rules for determining theviability of probes in silico were established by this experiment aswell as others.

Probes that were determined by the homology search to be less than idealwere intentionally placed on the array as controls. Also included on thearray were probes having various duplex Tm's and hairpin stabilities, toprovide a broad range of duplex melting temperatures and hairpinstabilities. As a result, duplex Tm's and hairpin stabilities could beempirically determined, and these two probe properties could be used asmetrics for predicting the performance of our probes in future arraydesigns.

To see the effectiveness of several other probe properties, we plottedsmoothed curves representing the slopes and p-values of the probes as afunction of various parameters as shown in FIGS. 16 and 17.

Here, the metric utilized and determined on a probe-by-probe basis isthe slope (which is a measure of the response of the LogRatio to chrXgenomic copy number for the target) of each probe. It is expected thatfor good probes higher slopes will be obtained (ideally approaching 1.0)and conversely lower slopes indicate poor performing probes. In FIG. 16a, each point on the curve represents the mean of a narrow band (5%) ofprobes, calculated as a moving average over the full range of theparameter under test. The y-axis represents the moving average of theslopes (calculated as the log-ratio of dye-normalized signals of themale sample to the female sample), whereas the X-axis points representthe moving average of the log (in base2) of the back-ground subtractedsignals for the probes in the reference channel, where the referencesample was a normal female sample with two copies of the X-chromosome ineach cell. For the background subtracted signal of the referencechannel, we can see that the slopes are fairly high for most of thelower half of the range and drops precipitously above a log2 value of 8,which is about 250 counts. The slopes are also decreased at the bottomof the signal distribution, probably as a result of the greater relativenoise near the detection limit of the system. The reason for the poorresponse of the probes at the higher end of the signal range is likelyrelated to the specificity of the probes. Even though these probes mayhave good homology scores, many of these poor performing high-signalprobes are probably binding non-specifically to unintended targets.Results are shown for 60-mers in FIG. 16 a, similarly for 45-mers inFIG. 16 b, and for 30-mers in FIG. 16 c. Results are quite similar forthe 45-mers and 60-mers with maximal slopes approaching 0.8, andsubstantially worse for 30-mers, which simply appear to haveinsufficiently duplex stability under these hybridization conditions.

Equivalent curves are shown (without p-values) in FIGS. 17 a, b, and cfor other metrics and probe properties. FIG. 17 a shows the relationshipbetween the performance of probes and their duplex T_(m)'s. The plotshows a relatively better probe response for lower duplex T_(m) and poorresponse for probes with higher melting temperatures. Although there isimproved probe performance below the temperature of 78 degrees, thereare relatively low probe densities at these lower temperatures and amaximum of available probes at about 80 degrees. These curves areclearly are dependent of the experimental conditions as well. Theoperating point of 80 degrees represents a trade off between probeperformance and the density of available probes. The maximum slope ofabout 0.9 at the low end of the distributions indicates that meltingtemperature is probably a very effective in silico metric. FIG. 17 b,shows the performance of probes as a function of persistence,corresponding to hybridization time-points measured at 1 hour and 24hours. As in FIG. 17 a, the points on both axes are smoothed over 5% ofthe probes for each point to more effectively show the trends. Thismetric though reasonably effective at finding some good probes (near thehigh end of the persistence distribution), is not very efficient atidentifying relatively poor probes.

FIG. 17 c shows the performance of probes as a function of theirHomLogS2B homology scores. Again, the points on both axes are smoothedover 5% of the probes for each point in the plot. As the large majorityof the probes have a HomLogS2B>18, most of the probes are within theright-most region of the plot. Only the bottom 6% of the probes have aHomS2B score<5 and these probes demonstrate particularly low signalslopes and hence poor probe performance. Consequently, this metric isparticularly effective in eliminating the relatively small number ofprobes with low HomLogS2B and poor specificity.

In conclusion, these experiments identified the calculated duplex T_(m)score and the homology scores as the most effective in silico predictorsof good probe performance, and the signal intensity as the bestempirical predictor that can be measured in simple experiments that canbe performed with small numbers of normal samples, such as self-selfexperiments.

While the present invention has been described with reference to thespecific embodiments thereof, it should be understood by those skilledin the art that various changes may be made and equivalents may besubstituted without departing from the true spirit and scope of theinvention. In addition, many modifications may be made to adapt aparticular situation, material, composition of matter, process, processstep or steps, to the objective, spirit and scope of the presentinvention. All such modifications are intended to be within the scope ofthe claims appended hereto.

1.-87. (canceled)
 88. A computer-implemented method for generating a setof probe nucleic acid sequences, the method comprising: (a) sorting aplurality of candidate probe nucleic acid sequences, for a genomicregion of interest, from smallest genomic distance to largest genomicdistance between neighboring candidate probe nucleic acid sequences toproduce a sorted plurality of candidate probe nucleic acid sequences;(b) evaluating a probe parameter for a neighboring pair of candidateprobe nucleic acid sequences from said sorted plurality to identify afirst member of said neighboring pair with a more desirable probeparameter than a second pair member of said neighboring pair; (c)removing said second pair member from said plurality; (d) reiteratingsaid sorting, evaluating and removing steps at least once to generate aset of probe nucleic acid sequences; and (e) outputting said set ofprobe nucleic acid sequences.
 89. The method according to claim 88,wherein said neighboring pair evaluated in step (b) is a pair that isclosest to each other in terms of genomic distance in said sortedplurality.
 90. The method according to claim 89, wherein said probeparameter is an in silico probe parameter.
 91. The method of claim 90,wherein said in silico probe parameter is selected from a groupconsisting of duplex melting temperature, hairpin stability, GC content,probe is within an exon, probe is within a gene, probe is within anintron, probe is within a intergenic region, and homology score.
 92. Themethod according to claim 88, comprising synthesizing at least one probenucleic acid having a sequence of a member probe nucleic acid sequenceof said set of probe nucleic acid sequences.
 93. The method according toclaim 92, wherein said method further comprises assaying said probenucleic acid in a hybridization assay.
 94. The method according to claim88, comprising generating a nucleic acid array comprising at least oneprobe nucleic acid having a sequence of a member probe nucleic acidsequence of said set of probe nucleic acid sequences.
 95. The methodaccording to claim 94, wherein said method further comprises contactingsaid nucleic acid array with a genomic sample.
 96. The method accordingto claim 91, wherein evaluating comprises evaluating duplex meltingtemperature and wherein said first member has a lower duplex meltingtemperature than said second pair member.
 97. The method according toclaim 91, wherein evaluating comprises evaluating homology score andwherein said first member has a higher homology score than said secondpair member.
 98. The method according to claim 88, where said pluralityof candidate probe nucleic acid sequences are generated by: (i)selecting target sequences from said genomic region of interest; (ii)repeat-masking said target sequences to form non-repeat masked regions;(iii) tiling sequences across said non-repeat masked regions to generatesaid candidate nucleic acid sequences; and (iv) screening said candidateprobes nucleic acid sequences according to at least one in silicoparameter.
 99. The method according to claim 98, comprising identifyingrestriction enzyme sites in the genomic region of interest, andselecting target sequences that exclude said restriction enzyme sites.